R Programming — Data Import, Cleaning & Processing
Clear explanations, examples, and ready-to-run R code (CSV generation → import → clean → process).
R data import and cleaning are core skills. Below we generate sample data (15 rows, 5 columns), save as CSV, show how to import it, then demonstrate common cleaning steps: checking types, handling missing values, renaming, converting factors/numerics, and creating derived columns. Each step is explained with base R functions students will use in real tasks.
The example dataset will simulate a small experiment or sales record: an ID column, a categorical group, two numeric measurements, and a date. After import we’ll:
The example dataset will simulate a small experiment or sales record: an ID column, a categorical group, two numeric measurements, and a date. After import we’ll:
- inspect structure with
str()andhead(), - treat missing values with simple imputation or removal (
is.na()), - coerce types (
as.numeric(),as.factor(),as.Date()), - rename columns, and
- create derived variables using arithmetic or conditional logic (e.g., categorise scores).
Detailed Explanation of Data Generation Code
Below is a clear breakdown of how each column of the dataset was generated in R.
These explanations help students understand why each function is used and what type of data it creates.
🔹 1.
The
This makes IDs readable and neatly aligned, which is helpful for data management.
ID = sprintf("S%02d", 1:n)The
sprintf() function formats text and numbers.
The pattern "S%02d" means:
- Start every ID with the letter S
%02d= format numbers so they always have 2 digits (padded with 0 if needed)
This makes IDs readable and neatly aligned, which is helpful for data management.
🔹 2.
This line randomly assigns each row to one of three categories: A, B, or C.
Group = sample(c("A","B","C"), n, replace = TRUE)This line randomly assigns each row to one of three categories: A, B, or C.
sample()picks random values from a vector.replace = TRUEallows the same category to appear multiple times.
🔹 3.
The
This is ideal for simulating measurement data such as exam scores or sensor readings.
Measure1 = round(rnorm(n, mean=50, sd=10), 1)The
rnorm() function generates normally distributed random numbers.
mean = 50→ center of distributionsd = 10→ spread/variationround(...,1)→ round values to 1 decimal place
This is ideal for simulating measurement data such as exam scores or sensor readings.
🔹 4.
The
Example values: 31.4, 72.8, 58.3.
This is often used when values should be equally likely across a range, such as temperature or random test scores.
Measure2 = round(runif(n, 30, 80), 1)The
runif() function generates values from a uniform distribution between 30 and 80.
Example values: 31.4, 72.8, 58.3.
This is often used when values should be equally likely across a range, such as temperature or random test scores.
🔹 5.
This creates a sequence of dates:
2025-01-01, 2025-01-08, 2025-01-15, …
This is useful for time-based datasets such as weekly sales, observations, or experimental timelines.
Date = seq(as.Date("2025-01-01"), by = "7 days", length.out = n)This creates a sequence of dates:
- Starting from 2025-01-01
- Incrementing by 7 days (weekly)
- Total of
ndates
2025-01-01, 2025-01-08, 2025-01-15, …
This is useful for time-based datasets such as weekly sales, observations, or experimental timelines.
End of data generation explanation — students can now understand how each variable was created.
# 1) GENERATE SAMPLE DATA (5 columns x 15 rows) - run in R
set.seed(42)
n <- 15
df <- data.frame(
ID = sprintf("S%02d", 1:n), # ID: character
Group = sample(c("A","B","C"), n, replace = TRUE), # Group: categorical
Measure1 = round(rnorm(n, mean=50, sd=10),1), # numeric
Measure2 = round(runif(n, 30, 80),1), # numeric
Date = seq(as.Date("2025-01-01"), by = "7 days", length.out = n) # Date
)
# Introduce some NAs for cleaning examples
df$Measure1[c(3,9)] <- NA
df$Group[5] <- NA
# Write to CSV in working directory
write.csv(df, file = "sample_data_rstudy.csv", row.names = FALSE)
# Check file created
list.files(pattern = "sample_data_rstudy.csv")
Explanation of the R code above:
set.seed()ensures reproducible random numbers (important for exercises).- We build a
data.framewith 5 columns and 15 rows:ID,Group, two numeric measures, and a date column. write.csv(..., row.names = FALSE)writes a CSV without R row numbers — that makes the CSV clean and portable.- We intentionally insert a few
NAs to show cleaning steps later.
# 2) IMPORT CSV
# Use read.csv() which is a base-R function
data_in <- read.csv("sample_data_rstudy.csv", stringsAsFactors = FALSE)
# Quick checks
head(data_in)
str(data_in)
summary(data_in)
Import notes:
read.csv()imports CSVs. SettingstringsAsFactors = FALSEavoids automatic conversion of strings to factors (gives you control).head()shows the first rows.str()reveals column types (character, numeric, etc.).summary()provides min/median/max for numeric columns and counts for character columns.
# 3) CLEANING & PROCESSING
# Convert types
data_in$ID <- as.character(data_in$ID)
data_in$Group <- as.factor(data_in$Group) # treat as factor (category)
data_in$Date <- as.Date(data_in$Date) # convert date column
data_in$Measure1 <- as.numeric(data_in$Measure1) # ensure numeric
data_in$Measure2 <- as.numeric(data_in$Measure2)
# Detect missing values
colSums(is.na(data_in)) # shows count of NAs per column
# Simple strategies:
# a) Remove rows with NAs:
data_dropna <- na.omit(data_in)
# b) Impute missing numeric values with mean (example for Measure1)
mean_m1 <- mean(data_in$Measure1, na.rm = TRUE)
data_impute <- data_in
data_impute$Measure1[is.na(data_impute$Measure1)] <- round(mean_m1,1)
# c) Fill missing Group with "Unknown"
data_impute$Group <- as.character(data_impute$Group)
data_impute$Group[is.na(data_impute$Group) | data_impute$Group==""] <- "Unknown"
data_impute$Group <- as.factor(data_impute$Group)
# Derived column: average of measures and a categorical flag
data_impute$Avg <- round((data_impute$Measure1 + data_impute$Measure2)/2,1)
data_impute$HighAvg <- ifelse(data_impute$Avg >= 55, "High", "Low")
data_impute$HighAvg <- as.factor(data_impute$HighAvg)
# Final check
str(data_impute)
head(data_impute)
Cleaning explanation and tips:
- Always confirm column types with
str(). Dates must beDateobjects for time series plotting. - Handle missing values deliberately: removal (
na.omit()) is simple but may bias results; imputation (mean/median or domain-specific) preserves row count. - Converting categories to factors (
as.factor()) is useful for grouping, table counts, and plotting categories. - Creating derived features (like
Avg) is commonly needed before plotting or modeling.
End of Topic 1 — data generation, CSV write/read, cleaning, and processing basics.
R Programming — Plotting with Base R (Topic 2)
Plotting is how you explore and present data. This section uses only base R plotting functions (no ggplot2) so students learn the fundamentals that always work in any R environment. We'll produce several common plots using the cleaned dataset created earlier: histogram, boxplot, scatterplot, barplot, line plot/time-series, pairs plot, and pie chart. Each example includes the code and explanation of why and when to use the plot.
Important base functions covered:
Important base functions covered:
hist(), boxplot(), plot() (scatter and line), barplot(), pie(), and pairs(). We'll also show how to add titles, axis labels, legends, colors (base R default or simple palettes), and use par() to arrange multiple plots in one display.
# Use the cleaned 'data_impute' from Topic 1
# 1) HISTOGRAM of Avg
hist(data_impute$Avg,
main = "Histogram of Average Score",
xlab = "Average",
ylab = "Frequency",
breaks = 8)
# 2) BOXPLOT of Measure1 by Group
boxplot(Measure1 ~ Group, data = data_impute,
main = "Measure1 by Group",
xlab = "Group", ylab = "Measure1",
notch = TRUE)
# 3) SCATTER PLOT Measure1 vs Measure2 with regression line
plot(data_impute$Measure1, data_impute$Measure2,
main = "Measure1 vs Measure2",
xlab = "Measure1", ylab = "Measure2", pch = 19)
# Add linear fit
fit <- lm(Measure2 ~ Measure1, data = data_impute)
abline(fit, lwd = 2)
# 4) BARPLOT: counts per Group
grp_tab <- table(data_impute$Group)
barplot(grp_tab, main = "Count by Group", ylab = "Count", xlab = "Group")
# 5) LINE PLOT: Avg over Date (time-series)
# Order by Date first
ord <- order(data_impute$Date)
plot(data_impute$Date[ord], data_impute$Avg[ord],
type = "o", main = "Avg over Time", xlab = "Date", ylab = "Avg")
# 6) PAIRS: quick multi-plot to inspect relationships
pairs(data_impute[, c("Measure1","Measure2","Avg")], main = "Pairs plot (Numeric)")
# 7) PIE CHART: proportion of HighAvg
pie(table(data_impute$HighAvg), main = "Proportion High vs Low Avg")
Plot explanations + examples:
- Histogram (`hist()`): Good for checking distribution shape (normal, skewed, multimodal). Use `breaks` to control bin width.
- Boxplot (`boxplot()`): Shows median, quartiles, and outliers. Use formula syntax like `y ~ x` to plot numeric by group.
- Scatter plot + regression (`plot()` + `lm()` + `abline()`): Visualise relationships between two numeric variables and add a fitted line to judge correlation.
- Barplot (`barplot()`): For categorical counts (converted by `table()`), e.g., number of samples in each group.
- Line/time plot (`plot(..., type="o")`): Plot a numeric variable over time; ensure your Date column is of class `Date` and data are ordered by date.
- Pairs plot (`pairs()`): Quick matrix of scatterplots for several numeric variables — great for exploratory data analysis.
- Pie chart (`pie()`): Use sparingly — shows proportions. For accessibility prefer barplot or a table.
Tip: In scripts destined for reproducible reports, save plots to files using base functions like
png("plot.png", width=800, height=600); ...; dev.off(). For interactive use, run plotting commands in the console or RStudio plot pane.
End of Topic 2 — Base R plotting essentials and examples.
R Code Summary & Helpful Quick Reference
Short cheat-sheet of commands used above (copy/paste friendly). These are base R and work without additional packages.
# Quick reference (base R)
read.csv("file.csv", stringsAsFactors = FALSE)
write.csv(df, "file.csv", row.names = FALSE)
str(df); head(df); summary(df)
is.na(df); colSums(is.na(df))
na.omit(df)
as.numeric(x); as.factor(x); as.Date(x)
hist(x); boxplot(y ~ group, data = df)
plot(x,y); abline(lm(y~x, data=df))
barplot(table(df$group)); pie(table(df$group))
pairs(df[c("num1","num2")])
png("file.png", width=800, height=600); plot(...); dev.off()
Final notes for students:
- Practice by changing the synthetic data generator (means, sd, groups) and observe how plots change.
- Document every cleaning step — keep raw CSV safe and create a cleaned version you use for analysis.
- Use base R plotting for fast exploration; later you can learn advanced visualizations (ggplot2) after mastering fundamentals.
Prepared for educational use — concise, SEO-friendly, and safe to drop into your WordPress content area.
Free Educational Resources
Free Study Material
English Vocabulary Tests
Grammar for Hearing Impaired
Math for Hearing Impaired
English Vocabulary
Case Study Questions Math
GRE Vocabulary Tests
LaTeX Tutorial
R Programming Tests
Online English Games
Online Math Games
Learn Computer Basics
JEE Math Tests
CAT Quant Tests
Increase Calculation Speed
Class 10 Math Tests
Prompt Engineering
SAT Math Tutorial
Logic Quizzes Grade 1
Logic Quizzes Grade 2
© 2023 Udgam Welfare Foundation. All rights reserved.

