Understanding Univariate and Multivariate Data in R

R Programming Study Material – Understanding Data Types

R Programming: Understanding Data Types – Univariate, Multivariate, Categorical & Quantitative

Topic 1: Understanding Univariate and Multivariate Data

In data analysis, understanding the difference between univariate and multivariate data is fundamental. Univariate data involves analysis of a single variable, focusing on describing its patterns and characteristics. This type of analysis helps us understand the distribution, central tendency, and spread of individual variables. Common univariate analyses include calculating mean, median, mode, standard deviation, and creating histograms or box plots.

Multivariate data, on the other hand, involves analyzing multiple variables simultaneously to understand relationships between them. This approach helps identify correlations, patterns, and interactions that might not be apparent when examining variables individually. Multivariate analysis is crucial for understanding complex real-world phenomena where multiple factors interact.

Key Definitions:
  • Univariate Analysis: Analysis of one variable at a time
  • Multivariate Analysis: Analysis of multiple variables simultaneously
# Univariate analysis examples
# Analyzing a single variable – student ages
student_ages <- c(18, 19, 20, 21, 22, 19, 20, 23, 18, 21)

# Univariate statistics
mean_age <- mean(student_ages)
median_age <- median(student_ages)
sd_age <- sd(student_ages)

# Multivariate analysis examples
# Analyzing relationship between age and test scores
student_data <- data.frame(
  age = c(18, 19, 20, 21, 22),
  test_score = c(85, 78, 92, 88, 95),
  study_hours = c(10, 8, 15, 12, 18)
)

# Multivariate analysis – correlation
correlation_matrix <- cor(student_data)
print(correlation_matrix)
Practical Example:

Imagine you’re analyzing customer data for an e-commerce store. Univariate analysis would look at individual metrics like “average purchase amount” or “most common product category.” Multivariate analysis would examine how “purchase amount” relates to “customer age,” “browsing time,” and “number of previous purchases” simultaneously.

Aspect Univariate Analysis Multivariate Analysis
Variables One variable Multiple variables
Purpose Describe individual variables Understand relationships between variables
Common Techniques Mean, median, histogram, box plot Correlation, regression, PCA, clustering
Complexity Low High

Practice Exercise: Univariate vs Multivariate Analysis

Using the built-in mtcars dataset in R, perform the following tasks:

  1. Conduct univariate analysis on the mpg (miles per gallon) variable
  2. Calculate mean, median, and standard deviation for mpg
  3. Create a histogram of mpg distribution
  4. Perform multivariate analysis by examining the relationship between mpg and hp (horsepower)
  5. Create a scatter plot showing mpg vs hp and calculate their correlation

Answer and Solutions

# Load the mtcars dataset
data(mtcars)

# 1. Univariate analysis on mpg
# 2. Calculate basic statistics
mpg_mean <- mean(mtcars$mpg)
mpg_median <- median(mtcars$mpg)
mpg_sd <- sd(mtcars$mpg)

cat(“MPG Statistics:\n”)
cat(“Mean:”, mpg_mean, “\n”)
cat(“Median:”, mpg_median, “\n”)
cat(“Standard Deviation:”, mpg_sd, “\n”)

# 3. Create histogram
hist(mtcars$mpg,
main = “Distribution of Miles Per Gallon”,
xlab = “MPG”,
col = “lightblue”,
border = “black”)

# 4. & 5. Multivariate analysis: mpg vs hp
plot(mtcars$hp, mtcars$mpg,
main = “MPG vs Horsepower”,
xlab = “Horsepower”,
ylab = “Miles Per Gallon”,
pch = 19,
col = “red”)

# Calculate correlation
mpg_hp_cor <- cor(mtcars$mpg, mtcars$hp)
cat(“Correlation between MPG and HP:”, mpg_hp_cor, “\n”)

# Add correlation line
abline(lm(mpg ~ hp, data = mtcars), col = “blue”)
Interpretation: The negative correlation between mpg and hp suggests that cars with higher horsepower tend to have lower fuel efficiency (mpg).

Topic 2: Understanding Categorical and Quantitative Data

Data can be broadly classified into two main types: categorical and quantitative. Categorical data (also called qualitative data) represents characteristics or qualities that can be grouped into categories. This type of data describes qualities rather than quantities and is typically non-numeric. Categorical data can be further divided into nominal (no inherent order) and ordinal (natural order).

Quantitative data represents numerical measurements and can be counted or measured. This type of data deals with numbers and can be subjected to mathematical operations. Quantitative data is further classified as discrete (countable, finite values) or continuous (measurable, infinite values within a range).

Key Definitions:
  • Categorical Data: Data that can be grouped into categories (e.g., colors, types, brands)
  • Quantitative Data: Numerical data that can be measured (e.g., height, weight, temperature)
# Categorical data examples
# Nominal categorical data (no order)
car_brands <- c(“Toyota”, “Honda”, “Ford”, “Toyota”, “BMW”, “Honda”)
car_brands_factor <- factor(car_brands)

# Ordinal categorical data (has order)
satisfaction_levels <- c(“Low”, “Medium”, “High”, “Medium”, “High”)
satisfaction_factor <- factor(satisfaction_levels,
  levels = c(“Low”, “Medium”, “High”),
  ordered = TRUE)

# Quantitative data examples
# Discrete quantitative data
number_of_children <- c(0, 1, 2, 3, 1, 0, 2, 4, 1)

# Continuous quantitative data
student_heights <- c(165.2, 170.5, 175.1, 162.8, 180.3, 168.9)

# Analyzing categorical data
brand_frequency <- table(car_brands)
brand_proportions <- prop.table(brand_frequency)

# Analyzing quantitative data
height_summary <- summary(student_heights)
Practical Example:

In a customer survey, categorical data includes responses like “product color” (red, blue, green) or “satisfaction level” (very satisfied, satisfied, neutral, dissatisfied). Quantitative data includes “age,” “annual income,” or “number of purchases.” The analysis methods differ: for categorical data, we use frequency tables and bar charts; for quantitative data, we use measures of central tendency and dispersion.

Characteristic Categorical Data Quantitative Data
Nature Qualitative, descriptive Numerical, measurable
Examples Gender, color, brand Height, weight, temperature
Analysis Methods Frequency tables, mode, chi-square Mean, median, standard deviation
Visualization Bar charts, pie charts Histograms, scatter plots
Mathematical Operations Counting, grouping All mathematical operations
Pro Tip: Always check your data types using class() or str() functions in R before analysis. Converting between data types appropriately is crucial for accurate analysis.

Practice Exercise: Categorical vs Quantitative Data Analysis

Using the built-in iris dataset in R, perform the following tasks:

  1. Identify which variables are categorical and which are quantitative
  2. For categorical variable(s), create a frequency table and bar plot
  3. For quantitative variables, calculate summary statistics (mean, median, sd)
  4. Create histograms for two quantitative variables
  5. Compare the distribution of a quantitative variable across different categories

Answer and Solutions

# Load the iris dataset
data(iris)

# 1. Identify variable types
str(iris)
cat(“\nVariable Types:\n”)
cat(“Sepal.Length: Quantitative (continuous)\n”)
cat(“Sepal.Width: Quantitative (continuous)\n”)
cat(“Petal.Length: Quantitative (continuous)\n”)
cat(“Petal.Width: Quantitative (continuous)\n”)
cat(“Species: Categorical (nominal)\n”)

# 2. Analyze categorical variable (Species)
species_freq <- table(iris$Species)
cat(“Species Frequency Table:\n”)
print(species_freq)

# Bar plot for species
barplot(species_freq,
main = “Frequency of Iris Species”,
xlab = “Species”,
ylab = “Frequency”,
col = c(“lightcoral”, “lightblue”, “lightgreen”))

# 3. Summary statistics for quantitative variables
quantitative_vars <- iris[, 1:4] # First 4 columns are quantitative

cat(“\nSummary Statistics for Quantitative Variables:\n”)
for (col_name in names(quantitative_vars)) {
  cat(“\n”, col_name, “:\n”)
  cat(” Mean:”, mean(quantitative_vars[[col_name]]), “\n”)
  cat(” Median:”, median(quantitative_vars[[col_name]]), “\n”)
  cat(” SD:”, sd(quantitative_vars[[col_name]]), “\n”)
}

# 4. Histograms for two quantitative variables
par(mfrow = c(1, 2)) # Create 1×2 plot layout

hist(iris$Sepal.Length,
main = “Distribution of Sepal Length”,
xlab = “Sepal Length (cm)”,
col = “lightblue”)

hist(iris$Petal.Length,
main = “Distribution of Petal Length”,
xlab = “Petal Length (cm)”,
col = “lightgreen”)

par(mfrow = c(1, 1)) # Reset plot layout

# 5. Compare quantitative variable across categories
# Compare Sepal.Length across different Species
boxplot(Sepal.Length ~ Species,
data = iris,
main = “Sepal Length by Iris Species”,
xlab = “Species”,
ylab = “Sepal Length (cm)”,
col = c(“lightcoral”, “lightblue”, “lightgreen”))
Interpretation: The box plot shows clear differences in sepal length across different iris species, with setosa having the shortest sepals and virginica the longest. This demonstrates how categorical variables (species) can help us understand patterns in quantitative variables (sepal length).

Topic 3: Data Type Conversion and Practical Handling

In real-world data analysis, you’ll often need to convert between different data types and handle mixed data types appropriately. Understanding how to work with different data types is crucial for effective data analysis in R. Data type conversion ensures that your analysis methods match the nature of your data.

Common conversion scenarios include converting character data to factors for categorical analysis, converting factors to numeric for calculations, and handling date-time data. Proper data type handling prevents errors in analysis and ensures that statistical methods are applied correctly.

# Common data type conversions in R
# Creating sample mixed data
mixed_data <- data.frame(
  id = 1:6,
  age = c(“25”, “30”, “35”, “28”, “32”, “29”), # Character numbers
  score = c(85, 92, 78, 88, 95, 82),
  grade = c(“A”, “B”, “A”, “C”, “B”, “A”),
  passed = c(“TRUE”, “TRUE”, “FALSE”, “TRUE”, “TRUE”, “FALSE”)
)

# Check current data types
str(mixed_data)

# Convert character to numeric
mixed_data$age <- as.numeric(mixed_data$age)

# Convert character to factor (categorical)
mixed_data$grade <- as.factor(mixed_data$grade)

# Convert character to logical
mixed_data$passed <- as.logical(mixed_data$passed)

# Check converted data types
str(mixed_data)

# Working with ordered factors
satisfaction <- c(“Low”, “High”, “Medium”, “Low”, “High”)
satisfaction_ordered <- factor(satisfaction,
  levels = c(“Low”, “Medium”, “High”),
  ordered = TRUE)

# Check if ordered
is.ordered(satisfaction_ordered)
Practical Example:

When importing data from CSV files, numerical data might be read as character strings if there are special characters or missing values. You’ll need to clean and convert these to appropriate numeric types. Similarly, categorical variables might be imported as character strings when they should be factors for statistical modeling.

Important Considerations:
  • Always check data types after importing data
  • Use factors for categorical variables in statistical models
  • Be careful when converting factors to numeric – use as.numeric(as.character())
  • Handle missing values before type conversion

Practice Exercise: Data Type Conversion and Analysis

Create a dataset with mixed data types and perform the following operations:

  1. Create a data frame with character, numeric, and logical data
  2. Convert appropriate columns to correct data types
  3. Create a categorical variable with natural ordering and convert to ordered factor
  4. Perform appropriate analysis based on the converted data types
  5. Create visualizations that match the data types

Answer and Solutions

# 1. Create dataset with mixed data types
student_data <- data.frame(
  student_id = 1:8,
  name = c(“Alice”, “Bob”, “Charlie”, “Diana”, “Eve”, “Frank”, “Grace”, “Henry”),
  age = c(“21”, “22”, “20”, “23”, “21”, “22”, “20”, “24”),
  gpa = c(3.8, 3.2, 3.9, 3.5, 3.7, 3.1, 3.6, 3.4),
  major = c(“CS”, “Math”, “CS”, “Physics”, “Math”, “CS”, “Physics”, “Math”),
  graduation_year = c(“2024”, “2024”, “2025”, “2024”, “2025”, “2024”, “2025”, “2024”),
  scholarship = c(“TRUE”, “FALSE”, “TRUE”, “TRUE”, “FALSE”, “TRUE”, “FALSE”, “TRUE”),
  performance = c(“Excellent”, “Good”, “Excellent”, “Average”, “Good”, “Average”, “Good”, “Excellent”)
)

# Check initial structure
cat(“Initial data structure:\n”)
str(student_data)

# 2. Convert to appropriate data types
student_data$age <- as.numeric(student_data$age)
student_data$major <- as.factor(student_data$major)
student_data$graduation_year <- as.factor(student_data$graduation_year)
student_data$scholarship <- as.logical(student_data$scholarship)

# 3. Create ordered factor for performance
student_data$performance <- factor(student_data$performance,
  levels = c(“Poor”, “Average”, “Good”, “Excellent”),
  ordered = TRUE)

# Check converted structure
cat(“\nConverted data structure:\n”)
str(student_data)

# 4. Perform analysis based on data types
# Quantitative analysis for GPA
cat(“\nGPA Summary:\n”)
summary(student_data$gpa)
cat(“Standard Deviation:”, sd(student_data$gpa), “\n”)

# Categorical analysis for major
cat(“\nMajor Distribution:\n”)
major_table <- table(student_data$major)
print(major_table)

# 5. Create appropriate visualizations
par(mfrow = c(2, 2))

# Histogram for quantitative data (GPA)
hist(student_data$gpa,
main = “Distribution of GPA”,
xlab = “GPA”,
col = “lightblue”)

# Bar plot for categorical data (Major)
barplot(major_table,
main = “Students by Major”,
xlab = “Major”,
ylab = “Count”,
col = “lightgreen”)

# Box plot comparing GPA across majors
boxplot(gpa ~ major,
data = student_data,
main = “GPA by Major”,
xlab = “Major”,
ylab = “GPA”,
col = c(“lightcoral”, “lightblue”, “lightgreen”))

# Bar plot for ordered categorical data
performance_table <- table(student_data$performance)
barplot(performance_table,
main = “Performance Levels”,
xlab = “Performance”,
ylab = “Count”,
col = “gold”)

par(mfrow = c(1, 1)) # Reset layout
Key Learning: Proper data type conversion ensures that you can apply appropriate statistical methods and visualizations. Quantitative data benefits from measures of central tendency and dispersion, while categorical data is best analyzed with frequency counts and proportions.