Understanding Univariate and Multivariate Data in R

R Programming Study Material – Understanding Data Types

R Programming: Understanding Data Types – Univariate, Multivariate, Categorical & Quantitative

Topic 1: Understanding Univariate and Multivariate Data

In data analysis, understanding the difference between univariate and multivariate data is fundamental. Univariate data involves analysis of a single variable, focusing on describing its patterns and characteristics. This type of analysis helps us understand the distribution, central tendency, and spread of individual variables. Common univariate analyses include calculating mean, median, mode, standard deviation, and creating histograms or box plots.

Multivariate data, on the other hand, involves analyzing multiple variables simultaneously to understand relationships between them. This approach helps identify correlations, patterns, and interactions that might not be apparent when examining variables individually. Multivariate analysis is crucial for understanding complex real-world phenomena where multiple factors interact.

Key Definitions:

Univariate Analysis: Analysis of one variable at a time
Multivariate Analysis: Analysis of multiple variables simultaneously

                    # Univariate analysis examples

                    # Analyzing a single variable – student ages

                    student_ages <- c(18, 19, 20, 21, 22, 19, 20, 23, 18, 21)

                    # Univariate statistics

                    mean_age <- mean(student_ages)

                    median_age <- median(student_ages)

                    sd_age <- sd(student_ages)

                    # Multivariate analysis examples

                    # Analyzing relationship between age and test scores

                    student_data <- data.frame(

                        age = c(18, 19, 20, 21, 22),

                        test_score = c(85, 78, 92, 88, 95),

                        study_hours = c(10, 8, 15, 12, 18)

                    )

                    # Multivariate analysis – correlation

                    correlation_matrix <- cor(student_data)

                    print(correlation_matrix)

Practical Example:

Imagine you’re analyzing customer data for an e-commerce store. Univariate analysis would look at individual metrics like “average purchase amount” or “most common product category.” Multivariate analysis would examine how “purchase amount” relates to “customer age,” “browsing time,” and “number of previous purchases” simultaneously.

Aspect	Univariate Analysis	Multivariate Analysis
Variables	One variable	Multiple variables
Purpose	Describe individual variables	Understand relationships between variables
Common Techniques	Mean, median, histogram, box plot	Correlation, regression, PCA, clustering
Complexity	Low	High

Practice Exercise: Univariate vs Multivariate Analysis

Using the built-in mtcars dataset in R, perform the following tasks:

Conduct univariate analysis on the mpg (miles per gallon) variable
Calculate mean, median, and standard deviation for mpg
Create a histogram of mpg distribution
Perform multivariate analysis by examining the relationship between mpg and hp (horsepower)
Create a scatter plot showing mpg vs hp and calculate their correlation

Answer and Solutions

                    # Load the mtcars dataset

                    data(mtcars)

                    # 1. Univariate analysis on mpg

                    # 2. Calculate basic statistics

                    mpg_mean <- mean(mtcars$mpg)

                    mpg_median <- median(mtcars$mpg)

                    mpg_sd <- sd(mtcars$mpg)

                    cat(“MPG Statistics:\n”)

                    cat(“Mean:”, mpg_mean, “\n”)

                    cat(“Median:”, mpg_median, “\n”)

                    cat(“Standard Deviation:”, mpg_sd, “\n”)

                    # 3. Create histogram

                    hist(mtcars$mpg, 

                         main = “Distribution of Miles Per Gallon”,

                         xlab = “MPG”, 

                         col = “lightblue”,

                         border = “black”)

                    # 4. & 5. Multivariate analysis: mpg vs hp

                    plot(mtcars$hp, mtcars$mpg,

                         main = “MPG vs Horsepower”,

                         xlab = “Horsepower”,

                         ylab = “Miles Per Gallon”,

                         pch = 19,

                         col = “red”)

                    # Calculate correlation

                    mpg_hp_cor <- cor(mtcars$mpg, mtcars$hp)

                    cat(“Correlation between MPG and HP:”, mpg_hp_cor, “\n”)

                    # Add correlation line

                    abline(lm(mpg ~ hp, data = mtcars), col = “blue”)

Interpretation: The negative correlation between mpg and hp suggests that cars with higher horsepower tend to have lower fuel efficiency (mpg).

Topic 2: Understanding Categorical and Quantitative Data

Data can be broadly classified into two main types: categorical and quantitative. Categorical data (also called qualitative data) represents characteristics or qualities that can be grouped into categories. This type of data describes qualities rather than quantities and is typically non-numeric. Categorical data can be further divided into nominal (no inherent order) and ordinal (natural order).

Quantitative data represents numerical measurements and can be counted or measured. This type of data deals with numbers and can be subjected to mathematical operations. Quantitative data is further classified as discrete (countable, finite values) or continuous (measurable, infinite values within a range).

Key Definitions:

Categorical Data: Data that can be grouped into categories (e.g., colors, types, brands)
Quantitative Data: Numerical data that can be measured (e.g., height, weight, temperature)

                    # Categorical data examples

                    # Nominal categorical data (no order)

                    car_brands <- c(“Toyota”, “Honda”, “Ford”, “Toyota”, “BMW”, “Honda”)

                    car_brands_factor <- factor(car_brands)

                    # Ordinal categorical data (has order)

                    satisfaction_levels <- c(“Low”, “Medium”, “High”, “Medium”, “High”)

                    satisfaction_factor <- factor(satisfaction_levels, 

                        levels = c(“Low”, “Medium”, “High”), 

                        ordered = TRUE)

                    # Quantitative data examples

                    # Discrete quantitative data

                    number_of_children <- c(0, 1, 2, 3, 1, 0, 2, 4, 1)

                    # Continuous quantitative data

                    student_heights <- c(165.2, 170.5, 175.1, 162.8, 180.3, 168.9)

                    # Analyzing categorical data

                    brand_frequency <- table(car_brands)

                    brand_proportions <- prop.table(brand_frequency)

                    # Analyzing quantitative data

                    height_summary <- summary(student_heights)

Practical Example:

In a customer survey, categorical data includes responses like “product color” (red, blue, green) or “satisfaction level” (very satisfied, satisfied, neutral, dissatisfied). Quantitative data includes “age,” “annual income,” or “number of purchases.” The analysis methods differ: for categorical data, we use frequency tables and bar charts; for quantitative data, we use measures of central tendency and dispersion.

Characteristic	Categorical Data	Quantitative Data
Nature	Qualitative, descriptive	Numerical, measurable
Examples	Gender, color, brand	Height, weight, temperature
Analysis Methods	Frequency tables, mode, chi-square	Mean, median, standard deviation
Visualization	Bar charts, pie charts	Histograms, scatter plots
Mathematical Operations	Counting, grouping	All mathematical operations

Pro Tip: Always check your data types using class() or str() functions in R before analysis. Converting between data types appropriately is crucial for accurate analysis.

Practice Exercise: Categorical vs Quantitative Data Analysis

Using the built-in iris dataset in R, perform the following tasks:

Identify which variables are categorical and which are quantitative
For categorical variable(s), create a frequency table and bar plot
For quantitative variables, calculate summary statistics (mean, median, sd)
Create histograms for two quantitative variables
Compare the distribution of a quantitative variable across different categories

Answer and Solutions

                    # Load the iris dataset

                    data(iris)

                    # 1. Identify variable types

                    str(iris)

                    cat(“\nVariable Types:\n”)

                    cat(“Sepal.Length: Quantitative (continuous)\n”)

                    cat(“Sepal.Width: Quantitative (continuous)\n”)

                    cat(“Petal.Length: Quantitative (continuous)\n”)

                    cat(“Petal.Width: Quantitative (continuous)\n”)

                    cat(“Species: Categorical (nominal)\n”)

                    # 2. Analyze categorical variable (Species)

                    species_freq <- table(iris$Species)

                    cat(“Species Frequency Table:\n”)

                    print(species_freq)

                    # Bar plot for species

                    barplot(species_freq, 

                            main = “Frequency of Iris Species”,

                            xlab = “Species”, 

                            ylab = “Frequency”,

                            col = c(“lightcoral”, “lightblue”, “lightgreen”))

                    # 3. Summary statistics for quantitative variables

                    quantitative_vars <- iris[, 1:4]  # First 4 columns are quantitative

                    cat(“\nSummary Statistics for Quantitative Variables:\n”)

                    for (col_name in names(quantitative_vars)) {

                        cat(“\n”, col_name, “:\n”)

                        cat(”  Mean:”, mean(quantitative_vars[[col_name]]), “\n”)

                        cat(”  Median:”, median(quantitative_vars[[col_name]]), “\n”)

                        cat(”  SD:”, sd(quantitative_vars[[col_name]]), “\n”)

                    }

                    # 4. Histograms for two quantitative variables

                    par(mfrow = c(1, 2))  # Create 1×2 plot layout

                    hist(iris$Sepal.Length, 

                         main = “Distribution of Sepal Length”,

                         xlab = “Sepal Length (cm)”,

                         col = “lightblue”)

                    hist(iris$Petal.Length,

                         main = “Distribution of Petal Length”,

                         xlab = “Petal Length (cm)”,

                         col = “lightgreen”)

                    par(mfrow = c(1, 1))  # Reset plot layout

                    # 5. Compare quantitative variable across categories

                    # Compare Sepal.Length across different Species

                    boxplot(Sepal.Length ~ Species, 

                            data = iris,

                            main = “Sepal Length by Iris Species”,

                            xlab = “Species”,

                            ylab = “Sepal Length (cm)”,

                            col = c(“lightcoral”, “lightblue”, “lightgreen”))

Interpretation: The box plot shows clear differences in sepal length across different iris species, with setosa having the shortest sepals and virginica the longest. This demonstrates how categorical variables (species) can help us understand patterns in quantitative variables (sepal length).

Topic 3: Data Type Conversion and Practical Handling

In real-world data analysis, you’ll often need to convert between different data types and handle mixed data types appropriately. Understanding how to work with different data types is crucial for effective data analysis in R. Data type conversion ensures that your analysis methods match the nature of your data.

Common conversion scenarios include converting character data to factors for categorical analysis, converting factors to numeric for calculations, and handling date-time data. Proper data type handling prevents errors in analysis and ensures that statistical methods are applied correctly.

                    # Common data type conversions in R

                    # Creating sample mixed data

                    mixed_data <- data.frame(

                        id = 1:6,

                        age = c(“25”, “30”, “35”, “28”, “32”, “29”),  # Character numbers

                        score = c(85, 92, 78, 88, 95, 82),

                        grade = c(“A”, “B”, “A”, “C”, “B”, “A”),

                        passed = c(“TRUE”, “TRUE”, “FALSE”, “TRUE”, “TRUE”, “FALSE”)

                    )

                    # Check current data types

                    str(mixed_data)

                    # Convert character to numeric

                    mixed_data$age <- as.numeric(mixed_data$age)

                    # Convert character to factor (categorical)

                    mixed_data$grade <- as.factor(mixed_data$grade)

                    # Convert character to logical

                    mixed_data$passed <- as.logical(mixed_data$passed)

                    # Check converted data types

                    str(mixed_data)

                    # Working with ordered factors

                    satisfaction <- c(“Low”, “High”, “Medium”, “Low”, “High”)

                    satisfaction_ordered <- factor(satisfaction, 

                        levels = c(“Low”, “Medium”, “High”), 

                        ordered = TRUE)

                    # Check if ordered

                    is.ordered(satisfaction_ordered)

Practical Example:

When importing data from CSV files, numerical data might be read as character strings if there are special characters or missing values. You’ll need to clean and convert these to appropriate numeric types. Similarly, categorical variables might be imported as character strings when they should be factors for statistical modeling.

Important Considerations:

Always check data types after importing data
Use factors for categorical variables in statistical models
Be careful when converting factors to numeric – use as.numeric(as.character())
Handle missing values before type conversion

Practice Exercise: Data Type Conversion and Analysis

Create a dataset with mixed data types and perform the following operations:

Create a data frame with character, numeric, and logical data
Convert appropriate columns to correct data types
Create a categorical variable with natural ordering and convert to ordered factor
Perform appropriate analysis based on the converted data types
Create visualizations that match the data types

Answer and Solutions

                    # 1. Create dataset with mixed data types

                    student_data <- data.frame(

                        student_id = 1:8,

                        name = c(“Alice”, “Bob”, “Charlie”, “Diana”, “Eve”, “Frank”, “Grace”, “Henry”),

                        age = c(“21”, “22”, “20”, “23”, “21”, “22”, “20”, “24”),

                        gpa = c(3.8, 3.2, 3.9, 3.5, 3.7, 3.1, 3.6, 3.4),

                        major = c(“CS”, “Math”, “CS”, “Physics”, “Math”, “CS”, “Physics”, “Math”),

                        graduation_year = c(“2024”, “2024”, “2025”, “2024”, “2025”, “2024”, “2025”, “2024”),

                        scholarship = c(“TRUE”, “FALSE”, “TRUE”, “TRUE”, “FALSE”, “TRUE”, “FALSE”, “TRUE”),

                        performance = c(“Excellent”, “Good”, “Excellent”, “Average”, “Good”, “Average”, “Good”, “Excellent”)

                    )

                    # Check initial structure

                    cat(“Initial data structure:\n”)

                    str(student_data)

                    # 2. Convert to appropriate data types

                    student_data$age <- as.numeric(student_data$age)

                    student_data$major <- as.factor(student_data$major)

                    student_data$graduation_year <- as.factor(student_data$graduation_year)

                    student_data$scholarship <- as.logical(student_data$scholarship)

                    # 3. Create ordered factor for performance

                    student_data$performance <- factor(student_data$performance,

                        levels = c(“Poor”, “Average”, “Good”, “Excellent”),

                        ordered = TRUE)

                    # Check converted structure

                    cat(“\nConverted data structure:\n”)

                    str(student_data)

                    # 4. Perform analysis based on data types

                    # Quantitative analysis for GPA

                    cat(“\nGPA Summary:\n”)

                    summary(student_data$gpa)

                    cat(“Standard Deviation:”, sd(student_data$gpa), “\n”)

                    # Categorical analysis for major

                    cat(“\nMajor Distribution:\n”)

                    major_table <- table(student_data$major)

                    print(major_table)

                    # 5. Create appropriate visualizations

                    par(mfrow = c(2, 2))

                    # Histogram for quantitative data (GPA)

                    hist(student_data$gpa, 

                         main = “Distribution of GPA”,

                         xlab = “GPA”, 

                         col = “lightblue”)

                    # Bar plot for categorical data (Major)

                    barplot(major_table, 

                            main = “Students by Major”,

                            xlab = “Major”, 

                            ylab = “Count”,

                            col = “lightgreen”)

                    # Box plot comparing GPA across majors

                    boxplot(gpa ~ major, 

                            data = student_data,

                            main = “GPA by Major”,

                            xlab = “Major”, 

                            ylab = “GPA”,

                            col = c(“lightcoral”, “lightblue”, “lightgreen”))

                    # Bar plot for ordered categorical data

                    performance_table <- table(student_data$performance)

                    barplot(performance_table,

                            main = “Performance Levels”,

                            xlab = “Performance”, 

                            ylab = “Count”,

                            col = “gold”)

                    par(mfrow = c(1, 1))  # Reset layout

Key Learning: Proper data type conversion ensures that you can apply appropriate statistical methods and visualizations. Quantitative data benefits from measures of central tendency and dispersion, while categorical data is best analyzed with frequency counts and proportions.