R Programming: Understanding Data Types – Univariate, Multivariate, Categorical & Quantitative
Topic 1: Understanding Univariate and Multivariate Data
In data analysis, understanding the difference between univariate and multivariate data is fundamental. Univariate data involves analysis of a single variable, focusing on describing its patterns and characteristics. This type of analysis helps us understand the distribution, central tendency, and spread of individual variables. Common univariate analyses include calculating mean, median, mode, standard deviation, and creating histograms or box plots.
Multivariate data, on the other hand, involves analyzing multiple variables simultaneously to understand relationships between them. This approach helps identify correlations, patterns, and interactions that might not be apparent when examining variables individually. Multivariate analysis is crucial for understanding complex real-world phenomena where multiple factors interact.
- Univariate Analysis: Analysis of one variable at a time
- Multivariate Analysis: Analysis of multiple variables simultaneously
# Analyzing a single variable – student ages
student_ages <- c(18, 19, 20, 21, 22, 19, 20, 23, 18, 21)
# Univariate statistics
mean_age <- mean(student_ages)
median_age <- median(student_ages)
sd_age <- sd(student_ages)
# Multivariate analysis examples
# Analyzing relationship between age and test scores
student_data <- data.frame(
age = c(18, 19, 20, 21, 22),
test_score = c(85, 78, 92, 88, 95),
study_hours = c(10, 8, 15, 12, 18)
)
# Multivariate analysis – correlation
correlation_matrix <- cor(student_data)
print(correlation_matrix)
Imagine you’re analyzing customer data for an e-commerce store. Univariate analysis would look at individual metrics like “average purchase amount” or “most common product category.” Multivariate analysis would examine how “purchase amount” relates to “customer age,” “browsing time,” and “number of previous purchases” simultaneously.
| Aspect | Univariate Analysis | Multivariate Analysis |
|---|---|---|
| Variables | One variable | Multiple variables |
| Purpose | Describe individual variables | Understand relationships between variables |
| Common Techniques | Mean, median, histogram, box plot | Correlation, regression, PCA, clustering |
| Complexity | Low | High |
Practice Exercise: Univariate vs Multivariate Analysis
Using the built-in mtcars dataset in R, perform the following tasks:
- Conduct univariate analysis on the mpg (miles per gallon) variable
- Calculate mean, median, and standard deviation for mpg
- Create a histogram of mpg distribution
- Perform multivariate analysis by examining the relationship between mpg and hp (horsepower)
- Create a scatter plot showing mpg vs hp and calculate their correlation
Answer and Solutions
data(mtcars)
# 1. Univariate analysis on mpg
# 2. Calculate basic statistics
mpg_mean <- mean(mtcars$mpg)
mpg_median <- median(mtcars$mpg)
mpg_sd <- sd(mtcars$mpg)
cat(“MPG Statistics:\n”)
cat(“Mean:”, mpg_mean, “\n”)
cat(“Median:”, mpg_median, “\n”)
cat(“Standard Deviation:”, mpg_sd, “\n”)
# 3. Create histogram
hist(mtcars$mpg,
main = “Distribution of Miles Per Gallon”,
xlab = “MPG”,
col = “lightblue”,
border = “black”)
# 4. & 5. Multivariate analysis: mpg vs hp
plot(mtcars$hp, mtcars$mpg,
main = “MPG vs Horsepower”,
xlab = “Horsepower”,
ylab = “Miles Per Gallon”,
pch = 19,
col = “red”)
# Calculate correlation
mpg_hp_cor <- cor(mtcars$mpg, mtcars$hp)
cat(“Correlation between MPG and HP:”, mpg_hp_cor, “\n”)
# Add correlation line
abline(lm(mpg ~ hp, data = mtcars), col = “blue”)
Topic 2: Understanding Categorical and Quantitative Data
Data can be broadly classified into two main types: categorical and quantitative. Categorical data (also called qualitative data) represents characteristics or qualities that can be grouped into categories. This type of data describes qualities rather than quantities and is typically non-numeric. Categorical data can be further divided into nominal (no inherent order) and ordinal (natural order).
Quantitative data represents numerical measurements and can be counted or measured. This type of data deals with numbers and can be subjected to mathematical operations. Quantitative data is further classified as discrete (countable, finite values) or continuous (measurable, infinite values within a range).
- Categorical Data: Data that can be grouped into categories (e.g., colors, types, brands)
- Quantitative Data: Numerical data that can be measured (e.g., height, weight, temperature)
# Nominal categorical data (no order)
car_brands <- c(“Toyota”, “Honda”, “Ford”, “Toyota”, “BMW”, “Honda”)
car_brands_factor <- factor(car_brands)
# Ordinal categorical data (has order)
satisfaction_levels <- c(“Low”, “Medium”, “High”, “Medium”, “High”)
satisfaction_factor <- factor(satisfaction_levels,
levels = c(“Low”, “Medium”, “High”),
ordered = TRUE)
# Quantitative data examples
# Discrete quantitative data
number_of_children <- c(0, 1, 2, 3, 1, 0, 2, 4, 1)
# Continuous quantitative data
student_heights <- c(165.2, 170.5, 175.1, 162.8, 180.3, 168.9)
# Analyzing categorical data
brand_frequency <- table(car_brands)
brand_proportions <- prop.table(brand_frequency)
# Analyzing quantitative data
height_summary <- summary(student_heights)
In a customer survey, categorical data includes responses like “product color” (red, blue, green) or “satisfaction level” (very satisfied, satisfied, neutral, dissatisfied). Quantitative data includes “age,” “annual income,” or “number of purchases.” The analysis methods differ: for categorical data, we use frequency tables and bar charts; for quantitative data, we use measures of central tendency and dispersion.
| Characteristic | Categorical Data | Quantitative Data |
|---|---|---|
| Nature | Qualitative, descriptive | Numerical, measurable |
| Examples | Gender, color, brand | Height, weight, temperature |
| Analysis Methods | Frequency tables, mode, chi-square | Mean, median, standard deviation |
| Visualization | Bar charts, pie charts | Histograms, scatter plots |
| Mathematical Operations | Counting, grouping | All mathematical operations |
Practice Exercise: Categorical vs Quantitative Data Analysis
Using the built-in iris dataset in R, perform the following tasks:
- Identify which variables are categorical and which are quantitative
- For categorical variable(s), create a frequency table and bar plot
- For quantitative variables, calculate summary statistics (mean, median, sd)
- Create histograms for two quantitative variables
- Compare the distribution of a quantitative variable across different categories
Answer and Solutions
data(iris)
# 1. Identify variable types
str(iris)
cat(“\nVariable Types:\n”)
cat(“Sepal.Length: Quantitative (continuous)\n”)
cat(“Sepal.Width: Quantitative (continuous)\n”)
cat(“Petal.Length: Quantitative (continuous)\n”)
cat(“Petal.Width: Quantitative (continuous)\n”)
cat(“Species: Categorical (nominal)\n”)
# 2. Analyze categorical variable (Species)
species_freq <- table(iris$Species)
cat(“Species Frequency Table:\n”)
print(species_freq)
# Bar plot for species
barplot(species_freq,
main = “Frequency of Iris Species”,
xlab = “Species”,
ylab = “Frequency”,
col = c(“lightcoral”, “lightblue”, “lightgreen”))
# 3. Summary statistics for quantitative variables
quantitative_vars <- iris[, 1:4] # First 4 columns are quantitative
cat(“\nSummary Statistics for Quantitative Variables:\n”)
for (col_name in names(quantitative_vars)) {
cat(“\n”, col_name, “:\n”)
cat(” Mean:”, mean(quantitative_vars[[col_name]]), “\n”)
cat(” Median:”, median(quantitative_vars[[col_name]]), “\n”)
cat(” SD:”, sd(quantitative_vars[[col_name]]), “\n”)
}
# 4. Histograms for two quantitative variables
par(mfrow = c(1, 2)) # Create 1×2 plot layout
hist(iris$Sepal.Length,
main = “Distribution of Sepal Length”,
xlab = “Sepal Length (cm)”,
col = “lightblue”)
hist(iris$Petal.Length,
main = “Distribution of Petal Length”,
xlab = “Petal Length (cm)”,
col = “lightgreen”)
par(mfrow = c(1, 1)) # Reset plot layout
# 5. Compare quantitative variable across categories
# Compare Sepal.Length across different Species
boxplot(Sepal.Length ~ Species,
data = iris,
main = “Sepal Length by Iris Species”,
xlab = “Species”,
ylab = “Sepal Length (cm)”,
col = c(“lightcoral”, “lightblue”, “lightgreen”))
Topic 3: Data Type Conversion and Practical Handling
In real-world data analysis, you’ll often need to convert between different data types and handle mixed data types appropriately. Understanding how to work with different data types is crucial for effective data analysis in R. Data type conversion ensures that your analysis methods match the nature of your data.
Common conversion scenarios include converting character data to factors for categorical analysis, converting factors to numeric for calculations, and handling date-time data. Proper data type handling prevents errors in analysis and ensures that statistical methods are applied correctly.
# Creating sample mixed data
mixed_data <- data.frame(
id = 1:6,
age = c(“25”, “30”, “35”, “28”, “32”, “29”), # Character numbers
score = c(85, 92, 78, 88, 95, 82),
grade = c(“A”, “B”, “A”, “C”, “B”, “A”),
passed = c(“TRUE”, “TRUE”, “FALSE”, “TRUE”, “TRUE”, “FALSE”)
)
# Check current data types
str(mixed_data)
# Convert character to numeric
mixed_data$age <- as.numeric(mixed_data$age)
# Convert character to factor (categorical)
mixed_data$grade <- as.factor(mixed_data$grade)
# Convert character to logical
mixed_data$passed <- as.logical(mixed_data$passed)
# Check converted data types
str(mixed_data)
# Working with ordered factors
satisfaction <- c(“Low”, “High”, “Medium”, “Low”, “High”)
satisfaction_ordered <- factor(satisfaction,
levels = c(“Low”, “Medium”, “High”),
ordered = TRUE)
# Check if ordered
is.ordered(satisfaction_ordered)
When importing data from CSV files, numerical data might be read as character strings if there are special characters or missing values. You’ll need to clean and convert these to appropriate numeric types. Similarly, categorical variables might be imported as character strings when they should be factors for statistical modeling.
- Always check data types after importing data
- Use factors for categorical variables in statistical models
- Be careful when converting factors to numeric – use as.numeric(as.character())
- Handle missing values before type conversion
Practice Exercise: Data Type Conversion and Analysis
Create a dataset with mixed data types and perform the following operations:
- Create a data frame with character, numeric, and logical data
- Convert appropriate columns to correct data types
- Create a categorical variable with natural ordering and convert to ordered factor
- Perform appropriate analysis based on the converted data types
- Create visualizations that match the data types
Answer and Solutions
student_data <- data.frame(
student_id = 1:8,
name = c(“Alice”, “Bob”, “Charlie”, “Diana”, “Eve”, “Frank”, “Grace”, “Henry”),
age = c(“21”, “22”, “20”, “23”, “21”, “22”, “20”, “24”),
gpa = c(3.8, 3.2, 3.9, 3.5, 3.7, 3.1, 3.6, 3.4),
major = c(“CS”, “Math”, “CS”, “Physics”, “Math”, “CS”, “Physics”, “Math”),
graduation_year = c(“2024”, “2024”, “2025”, “2024”, “2025”, “2024”, “2025”, “2024”),
scholarship = c(“TRUE”, “FALSE”, “TRUE”, “TRUE”, “FALSE”, “TRUE”, “FALSE”, “TRUE”),
performance = c(“Excellent”, “Good”, “Excellent”, “Average”, “Good”, “Average”, “Good”, “Excellent”)
)
# Check initial structure
cat(“Initial data structure:\n”)
str(student_data)
# 2. Convert to appropriate data types
student_data$age <- as.numeric(student_data$age)
student_data$major <- as.factor(student_data$major)
student_data$graduation_year <- as.factor(student_data$graduation_year)
student_data$scholarship <- as.logical(student_data$scholarship)
# 3. Create ordered factor for performance
student_data$performance <- factor(student_data$performance,
levels = c(“Poor”, “Average”, “Good”, “Excellent”),
ordered = TRUE)
# Check converted structure
cat(“\nConverted data structure:\n”)
str(student_data)
# 4. Perform analysis based on data types
# Quantitative analysis for GPA
cat(“\nGPA Summary:\n”)
summary(student_data$gpa)
cat(“Standard Deviation:”, sd(student_data$gpa), “\n”)
# Categorical analysis for major
cat(“\nMajor Distribution:\n”)
major_table <- table(student_data$major)
print(major_table)
# 5. Create appropriate visualizations
par(mfrow = c(2, 2))
# Histogram for quantitative data (GPA)
hist(student_data$gpa,
main = “Distribution of GPA”,
xlab = “GPA”,
col = “lightblue”)
# Bar plot for categorical data (Major)
barplot(major_table,
main = “Students by Major”,
xlab = “Major”,
ylab = “Count”,
col = “lightgreen”)
# Box plot comparing GPA across majors
boxplot(gpa ~ major,
data = student_data,
main = “GPA by Major”,
xlab = “Major”,
ylab = “GPA”,
col = c(“lightcoral”, “lightblue”, “lightgreen”))
# Bar plot for ordered categorical data
performance_table <- table(student_data$performance)
barplot(performance_table,
main = “Performance Levels”,
xlab = “Performance”,
ylab = “Count”,
col = “gold”)
par(mfrow = c(1, 1)) # Reset layout
