R Programming: DataFrames, Data Cleaning, and Plotting
This tutorial covers the fundamentals of working with data in R: creating dataframes, importing data, cleaning messy datasets, and creating visualizations using base R functions. All examples use simple, easy-to-understand datasets perfect for beginners.
1. Creating DataFrames in R
A DataFrame is R’s primary data structure for storing tabular data. Think of it as a spreadsheet with rows and columns where each column can contain different data types.
Creating a Simple DataFrame
Let’s create a simple dataset of student information:
# Create vectors of data
student_id <- 1:5
student_names <- c("Alice", "Bob", "Charlie", "Diana", "Eve")
math_scores <- c(85, 92, 78, 96, 88)
science_scores <- c(90, 85, 92, 79, 95)
passed <- c(TRUE, TRUE, TRUE, TRUE, TRUE)
# Combine into a dataframe
students_df <- data.frame(
id = student_id,
name = student_names,
math = math_scores,
science = science_scores,
passed = passed
)
# View the dataframe
print(students_df)
This creates the following dataframe:
| id | name | math | science | passed |
|---|---|---|---|---|
| 1 | Alice | 85 | 90 | TRUE |
| 2 | Bob | 92 | 85 | TRUE |
| 3 | Charlie | 78 | 92 | TRUE |
| 4 | Diana | 96 | 79 | TRUE |
| 5 | Eve | 88 | 95 | TRUE |
Tip: Use the str() function to examine the structure of your dataframe, showing data types and a preview of the data.
2. Importing and Exporting Data
In practice, you’ll usually import data from external files rather than creating dataframes manually.
Reading from CSV Files
# Read a CSV file
my_data <- read.csv("data_file.csv")
# If your CSV has a header row (column names)
my_data <- read.csv("data_file.csv", header = TRUE)
# If your CSV uses a different separator (like semicolon)
my_data <- read.csv("data_file.csv", sep = ";")
# Prevent strings from automatically converting to factors
my_data <- read.csv("data_file.csv", stringsAsFactors = FALSE)
Writing to CSV Files
# Write dataframe to CSV
write.csv(students_df, file = "students_data.csv", row.names = FALSE)
Note: Setting row.names = FALSE prevents R from adding an extra column with row numbers, which is usually not needed in exported data.
3. Data Cleaning and Preparation
Real-world data is often messy. Let’s create a dataset with common issues and learn how to fix them.
# Create a dataset with common data issues
messy_data <- data.frame(
id = 1:6,
name = c("Alice", "BOB", "charlie", "Diana", "EVE", "Frank"),
age = c(20, 25, NA, 22, 30, 35),
score = c("85", "92", "78", "ninety-six", "88", "95"),
grade = c("B", "A", "C", "A", "B", "A"),
date_joined = c("2023-01-15", "2023-02-20", "2023-01-10",
"2023-03-05", "2023-02-28", "2023-04-12")
)
print(messy_data)
Common Data Cleaning Tasks
# 1. Check for missing values
sum(is.na(messy_data)) # Total missing values
colSums(is.na(messy_data)) # Missing values by column
# 2. Fix inconsistent text (convert to proper case)
messy_data$name <- tolower(messy_data$name) # First make all lowercase
substr(messy_data$name, 1, 1) <- toupper(substr(messy_data$name, 1, 1)) # Capitalize first letter
# 3. Handle missing values
# Option A: Remove rows with missing values
clean_data <- na.omit(messy_data)
# Option B: Fill missing values (with mean for numeric columns)
mean_age <- mean(messy_data$age, na.rm = TRUE)
messy_data$age[is.na(messy_data$age)] <- mean_age
# 4. Fix data types
# Convert score from character to numeric (non-numeric becomes NA)
messy_data$score <- as.numeric(messy_data$score)
# Convert date from character to Date type
messy_data$date_joined <- as.Date(messy_data$date_joined)
# Convert grade to factor (categorical variable)
messy_data$grade <- as.factor(messy_data$grade)
# 5. Check the cleaned data
str(messy_data)
summary(messy_data)
Tip: Always check your data after cleaning using str() and summary() to ensure all transformations worked as expected.
4. Data Manipulation
Once your data is clean, you often need to transform it for analysis.
# Create a simple dataset for manipulation
sales_data <- data.frame(
month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"),
product_a = c(150, 200, 175, 220, 190, 210),
product_b = c(180, 160, 195, 170, 205, 185),
region = c("North", "South", "North", "South", "North", "South")
)
# 1. Add a new column (total sales)
sales_data$total_sales <- sales_data$product_a + sales_data$product_b
# 2. Create a conditional column
sales_data$performance <- ifelse(sales_data$total_sales > 380, "High", "Low")
# 3. Subset data (filter rows)
high_sales <- sales_data[sales_data$total_sales > 380, ]
north_region <- sales_data[sales_data$region == "North", ]
# 4. Select specific columns
product_data <- sales_data[, c("month", "product_a", "product_b")]
# 5. Sort data
sales_sorted <- sales_data[order(sales_data$total_sales, decreasing = TRUE), ]
print(sales_data)
5. Plotting with Base R
R has powerful built-in plotting functions. Let’s explore the most common types of plots.
Creating Sample Data for Plotting
# Create sample data for plotting
set.seed(123) # For reproducible random numbers
plot_data <- data.frame(
category = rep(c("A", "B", "C", "D"), each = 10),
value = c(rnorm(10, 50, 10), rnorm(10, 60, 8),
rnorm(10, 55, 12), rnorm(10, 65, 9)),
group = rep(c("X", "Y"), 20),
time = 1:40
)
# Add some relationship for scatter plots
plot_data$related_var <- plot_data$value * 1.5 + rnorm(40, 0, 15)
Basic Plot Types
# 1. Histogram - shows distribution of a single variable
hist(plot_data$value,
main = "Distribution of Values",
xlab = "Value",
ylab = "Frequency",
col = "lightblue",
border = "black")
# 2. Boxplot - shows distribution by category
boxplot(value ~ category, data = plot_data,
main = "Values by Category",
xlab = "Category",
ylab = "Value",
col = c("lightcoral", "lightgreen", "lightyellow", "lightblue"))
# 3. Scatter plot - shows relationship between two variables
plot(plot_data$value, plot_data$related_var,
main = "Value vs Related Variable",
xlab = "Value",
ylab = "Related Variable",
pch = 16, # Type of point
col = "darkblue")
# Add a trend line
fit <- lm(related_var ~ value, data = plot_data)
abline(fit, lwd = 2)
# 4. BARPLOT: counts per Group
grp_tab <- table(plot_data$category)
barplot(grp_tab, main = "Count by Category", ylab = "Count", xlab = "Category")
# 5. LINE PLOT: Value over Time (time-series)
# Order by Time first
ord <- order(plot_data$time)
plot(plot_data$time[ord], plot_data$value[ord],
type = "o", main = "Value over Time", xlab = "Time", ylab = "Value")
# 6. PAIRS: quick multi-plot to inspect relationships
pairs(plot_data[, c("value","related_var")], main = "Pairs plot (Numeric)")
# 7. PIE CHART: proportion of Groups
pie(table(plot_data$group), main = "Proportion of Groups")
Customizing Plots
# Create a customized plot with multiple elements
plot(plot_data$value, plot_data$related_var,
main = "Customized Scatter Plot",
xlab = "Main Variable",
ylab = "Related Variable",
pch = ifelse(plot_data$group == "X", 16, 17), # Different shapes for groups
col = ifelse(plot_data$group == "X", "blue", "red"), # Different colors for groups
cex = 1.2) # Point size
# Add a legend
legend("topleft",
legend = c("Group X", "Group Y"),
pch = c(16, 17),
col = c("blue", "red"),
title = "Groups")
# Add grid lines
grid()
# Save plot to file
# png("my_plot.png", width = 800, height = 600)
# plot(...)
# dev.off()
Tip: Use par(mfrow = c(2, 2)) to create a 2×2 grid of plots. This is useful for comparing multiple visualizations. Reset with par(mfrow = c(1, 1)).
6. Putting It All Together: Complete Example
Let’s walk through a complete example from data creation to visualization.
# Step 1: Create sample sales data
set.seed(456)
months <- month.name[1:6]
regions <- c("North", "South", "East", "West")
sales <- data.frame(
month = rep(months, each = 4),
region = rep(regions, 6),
revenue = runif(24, 1000, 5000),
expenses = runif(24, 500, 3000)
)
# Step 2: Calculate profit
sales$profit <- sales$revenue - sales$expenses
# Step 3: Add a performance indicator
sales$performance <- ifelse(sales$profit > 2000, "Excellent",
ifelse(sales$profit > 1000, "Good", "Needs Improvement"))
# Step 4: Convert to factors
sales$month <- factor(sales$month, levels = months)
sales$region <- as.factor(sales$region)
sales$performance <- factor(sales$performance,
levels = c("Needs Improvement", "Good", "Excellent"))
# Step 5: Create summary statistics by region
region_summary <- aggregate(profit ~ region, data = sales, mean)
# Step 6: Visualize the data
# Set up a 2x2 plot layout
par(mfrow = c(2, 2))
# Plot 1: Bar chart of average profit by region
barplot(region_summary$profit,
names.arg = region_summary$region,
main = "Average Profit by Region",
ylab = "Profit ($)",
col = "lightgreen")
# Plot 2: Boxplot of profit by performance category
boxplot(profit ~ performance, data = sales,
main = "Profit by Performance",
ylab = "Profit ($)",
col = c("lightcoral", "lightyellow", "lightgreen"))
# Plot 3: Revenue vs Expenses scatter plot
plot(sales$revenue, sales$expenses,
main = "Revenue vs Expenses",
xlab = "Revenue ($)",
ylab = "Expenses ($)",
pch = 16,
col = as.numeric(sales$region))
# Add a reference line for break-even
abline(a = 0, b = 1, lty = 2, col = "red")
# Plot 4: Profit trend over months (by region)
north_data <- sales[sales$region == "North", ]
south_data <- sales[sales$region == "South", ]
plot(north_data$month, north_data$profit,
type = "o",
main = "Profit Trend: North vs South",
xlab = "Month",
ylab = "Profit ($)",
col = "blue",
ylim = range(sales$profit))
lines(south_data$month, south_data$profit, type = "o", col = "red")
legend("topright", legend = c("North", "South"), col = c("blue", "red"), lty = 1)
# Reset plot layout
par(mfrow = c(1, 1))
# Step 7: Save the cleaned data
write.csv(sales, "cleaned_sales_data.csv", row.names = FALSE)
Summary of Key R Functions
| Function | Purpose | Example |
|---|---|---|
data.frame() |
Create a dataframe | df <- data.frame(x=1:3, y=c("a","b","c")) |
read.csv() |
Import CSV file | data <- read.csv("file.csv") |
write.csv() |
Export to CSV | write.csv(df, "file.csv") |
str() |
Examine structure | str(df) |
summary() |
Summary statistics | summary(df) |
is.na() |
Find missing values | is.na(df$column) |
na.omit() |
Remove rows with NAs | clean_df <- na.omit(df) |
as.numeric() |
Convert to numeric | df$num <- as.numeric(df$char) |
as.factor() |
Convert to factor | df$cat <- as.factor(df$char) |
plot() |
Create various plots | plot(x, y) |
hist() |
Create histogram | hist(df$values) |
boxplot() |
Create boxplot | boxplot(values ~ group, data=df) |
Practice Tip: The best way to learn R is by doing. Try modifying the examples above - change the data, adjust the plots, and experiment with different functions. Don't worry about making mistakes; that's how we learn!

