Data Import, Plotting, and Cleaning in R Programming

R Programming — Data Import, Cleaning & Processing

Clear explanations, examples, and ready-to-run R code (CSV generation → import → clean → process).
R data import and cleaning are core skills. Below we generate sample data (15 rows, 5 columns), save as CSV, show how to import it, then demonstrate common cleaning steps: checking types, handling missing values, renaming, converting factors/numerics, and creating derived columns. Each step is explained with base R functions students will use in real tasks.

The example dataset will simulate a small experiment or sales record: an ID column, a categorical group, two numeric measurements, and a date. After import we’ll:
  • inspect structure with str() and head(),
  • treat missing values with simple imputation or removal (is.na()),
  • coerce types (as.numeric(), as.factor(), as.Date()),
  • rename columns, and
  • create derived variables using arithmetic or conditional logic (e.g., categorise scores).
These operations are critical because plotting and analysis need clean, correctly-typed data. Example use-cases: exam scores, sensor readings, or simple sales logs.

Detailed Explanation of Data Generation Code

Below is a clear breakdown of how each column of the dataset was generated in R. These explanations help students understand why each function is used and what type of data it creates.
🔹 1. ID = sprintf("S%02d", 1:n)

The sprintf() function formats text and numbers. The pattern "S%02d" means:
  • Start every ID with the letter S
  • %02d = format numbers so they always have 2 digits (padded with 0 if needed)
Examples produced: S01, S02, S03.

This makes IDs readable and neatly aligned, which is helpful for data management.
🔹 2. Group = sample(c("A","B","C"), n, replace = TRUE)

This line randomly assigns each row to one of three categories: A, B, or C.
  • sample() picks random values from a vector.
  • replace = TRUE allows the same category to appear multiple times.
This is commonly used to simulate group labels in real-world datasets.
🔹 3. Measure1 = round(rnorm(n, mean=50, sd=10), 1)

The rnorm() function generates normally distributed random numbers.
  • mean = 50 → center of distribution
  • sd = 10 → spread/variation
  • round(...,1) → round values to 1 decimal place
Example values: 43.2, 55.7, 49.9.

This is ideal for simulating measurement data such as exam scores or sensor readings.
🔹 4. Measure2 = round(runif(n, 30, 80), 1)

The runif() function generates values from a uniform distribution between 30 and 80.

Example values: 31.4, 72.8, 58.3.

This is often used when values should be equally likely across a range, such as temperature or random test scores.
🔹 5. Date = seq(as.Date("2025-01-01"), by = "7 days", length.out = n)

This creates a sequence of dates:
  • Starting from 2025-01-01
  • Incrementing by 7 days (weekly)
  • Total of n dates
Example sequence:
2025-01-01, 2025-01-08, 2025-01-15, …

This is useful for time-based datasets such as weekly sales, observations, or experimental timelines.
# 1) GENERATE SAMPLE DATA (5 columns x 15 rows) - run in R set.seed(42) n <- 15 df <- data.frame( ID = sprintf("S%02d", 1:n), # ID: character Group = sample(c("A","B","C"), n, replace = TRUE), # Group: categorical Measure1 = round(rnorm(n, mean=50, sd=10),1), # numeric Measure2 = round(runif(n, 30, 80),1), # numeric Date = seq(as.Date("2025-01-01"), by = "7 days", length.out = n) # Date ) # Introduce some NAs for cleaning examples df$Measure1[c(3,9)] <- NA df$Group[5] <- NA # Write to CSV in working directory write.csv(df, file = "sample_data_rstudy.csv", row.names = FALSE) # Check file created list.files(pattern = "sample_data_rstudy.csv")
Explanation of the R code above:
  • set.seed() ensures reproducible random numbers (important for exercises).
  • We build a data.frame with 5 columns and 15 rows: ID, Group, two numeric measures, and a date column.
  • write.csv(..., row.names = FALSE) writes a CSV without R row numbers — that makes the CSV clean and portable.
  • We intentionally insert a few NAs to show cleaning steps later.
# 2) IMPORT CSV # Use read.csv() which is a base-R function data_in <- read.csv("sample_data_rstudy.csv", stringsAsFactors = FALSE) # Quick checks head(data_in) str(data_in) summary(data_in)
Import notes:
  • read.csv() imports CSVs. Setting stringsAsFactors = FALSE avoids automatic conversion of strings to factors (gives you control).
  • head() shows the first rows. str() reveals column types (character, numeric, etc.). summary() provides min/median/max for numeric columns and counts for character columns.
Next we clean types and missing values so plotting and numeric summaries behave correctly.
# 3) CLEANING & PROCESSING # Convert types data_in$ID <- as.character(data_in$ID) data_in$Group <- as.factor(data_in$Group) # treat as factor (category) data_in$Date <- as.Date(data_in$Date) # convert date column data_in$Measure1 <- as.numeric(data_in$Measure1) # ensure numeric data_in$Measure2 <- as.numeric(data_in$Measure2) # Detect missing values colSums(is.na(data_in)) # shows count of NAs per column # Simple strategies: # a) Remove rows with NAs: data_dropna <- na.omit(data_in) # b) Impute missing numeric values with mean (example for Measure1) mean_m1 <- mean(data_in$Measure1, na.rm = TRUE) data_impute <- data_in data_impute$Measure1[is.na(data_impute$Measure1)] <- round(mean_m1,1) # c) Fill missing Group with "Unknown" data_impute$Group <- as.character(data_impute$Group) data_impute$Group[is.na(data_impute$Group) | data_impute$Group==""] <- "Unknown" data_impute$Group <- as.factor(data_impute$Group) # Derived column: average of measures and a categorical flag data_impute$Avg <- round((data_impute$Measure1 + data_impute$Measure2)/2,1) data_impute$HighAvg <- ifelse(data_impute$Avg >= 55, "High", "Low") data_impute$HighAvg <- as.factor(data_impute$HighAvg) # Final check str(data_impute) head(data_impute)
Cleaning explanation and tips:
  • Always confirm column types with str(). Dates must be Date objects for time series plotting.
  • Handle missing values deliberately: removal (na.omit()) is simple but may bias results; imputation (mean/median or domain-specific) preserves row count.
  • Converting categories to factors (as.factor()) is useful for grouping, table counts, and plotting categories.
  • Creating derived features (like Avg) is commonly needed before plotting or modeling.

R Programming — Plotting with Base R (Topic 2)

Plotting is how you explore and present data. This section uses only base R plotting functions (no ggplot2) so students learn the fundamentals that always work in any R environment. We'll produce several common plots using the cleaned dataset created earlier: histogram, boxplot, scatterplot, barplot, line plot/time-series, pairs plot, and pie chart. Each example includes the code and explanation of why and when to use the plot.

Important base functions covered: hist(), boxplot(), plot() (scatter and line), barplot(), pie(), and pairs(). We'll also show how to add titles, axis labels, legends, colors (base R default or simple palettes), and use par() to arrange multiple plots in one display.
# Use the cleaned 'data_impute' from Topic 1 # 1) HISTOGRAM of Avg hist(data_impute$Avg, main = "Histogram of Average Score", xlab = "Average", ylab = "Frequency", breaks = 8) # 2) BOXPLOT of Measure1 by Group boxplot(Measure1 ~ Group, data = data_impute, main = "Measure1 by Group", xlab = "Group", ylab = "Measure1", notch = TRUE) # 3) SCATTER PLOT Measure1 vs Measure2 with regression line plot(data_impute$Measure1, data_impute$Measure2, main = "Measure1 vs Measure2", xlab = "Measure1", ylab = "Measure2", pch = 19) # Add linear fit fit <- lm(Measure2 ~ Measure1, data = data_impute) abline(fit, lwd = 2) # 4) BARPLOT: counts per Group grp_tab <- table(data_impute$Group) barplot(grp_tab, main = "Count by Group", ylab = "Count", xlab = "Group") # 5) LINE PLOT: Avg over Date (time-series) # Order by Date first ord <- order(data_impute$Date) plot(data_impute$Date[ord], data_impute$Avg[ord], type = "o", main = "Avg over Time", xlab = "Date", ylab = "Avg") # 6) PAIRS: quick multi-plot to inspect relationships pairs(data_impute[, c("Measure1","Measure2","Avg")], main = "Pairs plot (Numeric)") # 7) PIE CHART: proportion of HighAvg pie(table(data_impute$HighAvg), main = "Proportion High vs Low Avg")
Plot explanations + examples:
  • Histogram (`hist()`): Good for checking distribution shape (normal, skewed, multimodal). Use `breaks` to control bin width.
  • Boxplot (`boxplot()`): Shows median, quartiles, and outliers. Use formula syntax like `y ~ x` to plot numeric by group.
  • Scatter plot + regression (`plot()` + `lm()` + `abline()`): Visualise relationships between two numeric variables and add a fitted line to judge correlation.
  • Barplot (`barplot()`): For categorical counts (converted by `table()`), e.g., number of samples in each group.
  • Line/time plot (`plot(..., type="o")`): Plot a numeric variable over time; ensure your Date column is of class `Date` and data are ordered by date.
  • Pairs plot (`pairs()`): Quick matrix of scatterplots for several numeric variables — great for exploratory data analysis.
  • Pie chart (`pie()`): Use sparingly — shows proportions. For accessibility prefer barplot or a table.
Example interpretation: if the boxplot shows Group B has higher median `Measure1`, you might inspect Group B rows for experimental differences or verify if a confounder exists.
Tip: In scripts destined for reproducible reports, save plots to files using base functions like png("plot.png", width=800, height=600); ...; dev.off(). For interactive use, run plotting commands in the console or RStudio plot pane.

R Code Summary & Helpful Quick Reference

Short cheat-sheet of commands used above (copy/paste friendly). These are base R and work without additional packages.
# Quick reference (base R) read.csv("file.csv", stringsAsFactors = FALSE) write.csv(df, "file.csv", row.names = FALSE) str(df); head(df); summary(df) is.na(df); colSums(is.na(df)) na.omit(df) as.numeric(x); as.factor(x); as.Date(x) hist(x); boxplot(y ~ group, data = df) plot(x,y); abline(lm(y~x, data=df)) barplot(table(df$group)); pie(table(df$group)) pairs(df[c("num1","num2")]) png("file.png", width=800, height=600); plot(...); dev.off()
Final notes for students:
  1. Practice by changing the synthetic data generator (means, sd, groups) and observe how plots change.
  2. Document every cleaning step — keep raw CSV safe and create a cleaned version you use for analysis.
  3. Use base R plotting for fast exploration; later you can learn advanced visualizations (ggplot2) after mastering fundamentals.

Educational Resources Footer
GitHub