🎯 Topic: How to Filter Data in R Programming (Base R)
Overview (150+ words):
Filtering is the process of selecting rows from a dataset that meet one or more criteria. In R, filtering is commonly done with base R tools such as logical indexing (df[condition, ]), subset(), which() with row indices, pattern-matching with grepl(), membership testing with %in%, and handling missing values with complete.cases() or is.na(). These methods let you select single-condition rows (e.g., Age >= 15), combine conditions with && / | (logical AND/OR), negate conditions with !, filter by multiple values (e.g., Gender %in% c('M','F')), and extract rows matching text patterns (e.g., names that start with ‘A’). Base R filtering is efficient and widely available—no external packages required—making it ideal for environments where additional libraries are unavailable. Below you’ll find a 30-row dataset called class_data, explained step-by-step, followed by concrete filter examples that use only that dataset. Each code chunk includes a brief explanation so students understand what the code does, why it works, and how to interpret the output.
Dataset: class_data (6 columns, 30 rows)
The sample dataset class_data simulates a small classroom survey with columns: ID, Name, Age, Gender, math_score, status (Pass/Fail). Use only this dataset for the examples below.
R code: Create class_data (base R only)
# Create a reproducible 30-row dataset
set.seed(2025)
ID <- 1:30
Name <- paste0("Stu", sprintf("%02d", ID))
Age <- sample(12:18, 30, replace = TRUE)
Gender <- sample(c("M","F"), 30, replace = TRUE)
math_score <- pmin(pmax(round(rnorm(30, mean = 70, sd = 12)), 0), 100)
status <- ifelse(math_score >= 60, "Pass", "Fail")
class_data <- data.frame(ID, Name, Age, Gender, math_score, status, stringsAsFactors = FALSE)
# Quick check
dim(class_data) # should be 30 6
head(class_data, 6)
Explanation of the dataset
class_data has 30 rows. ID is an integer identifier; Name is a short label; Age is integer (12–18); Gender is 'M' or 'F'; math_score is numeric (0–100); status is 'Pass' or 'Fail' derived from math_score. All filtering examples below use only class_data.
Example 1 — Simple logical filter (rows where math_score ≥ 80)
# Rows where math_score >= 80 high_scorers <- class_data[class_data$math_score >= 80, ] print(high_scorers)
Explanation: Subsetting with df[rows, cols] uses a logical vector for rows. The condition class_data$math_score >= 80 returns TRUE for rows that meet the criterion; those rows are selected. The result is a data frame of students scoring 80 or above.
Example 2 — Multiple conditions (Age ≥ 15 AND status == "Pass")
# Age >= 15 and Pass older_pass <- class_data[class_data$Age >= 15 && class_data$status == "Pass", ] # WRONG: avoid && for vectors # Correct version using & older_pass <- class_data[class_data$Age >= 15 & class_data$status == "Pass", ] print(older_pass)
Explanation: Use element-wise logical operators & and | (not && or ||) when filtering vectors. This selects rows where both conditions are TRUE.
Example 3 — Use subset() for readable filters
# Using subset() to get female students who failed female_failed <- subset(class_data, Gender == "F" & status == "Fail") print(female_failed)
Explanation: subset() is convenient and readable: subset(df, condition, select = c(...)). It's equivalent to df[df$... , ] but cleaner for simple tasks.
Example 4 — Use which() to get row indices first
# Which rows have Age == 14? rows_age14 <- which(class_data$Age == 14) class_data[rows_age14, ]
Explanation: which() returns integer indices of TRUE elements, useful when you need row numbers (for reporting or further indexing).
Example 5 — Filter by multiple values using %in%
# Select students whose Age is 13, 15 or 17 sel_ages <- class_data[class_data$Age %in% c(13,15,17), ] print(sel_ages)
Explanation: %in% checks membership in a vector—handy for selecting several possible values without long OR-chains.
Example 6 — Pattern matching with grepl() (filter names starting with 'Stu0')
# Names starting with "Stu0" (first ten students)
mask <- grepl("^Stu0", class_data$Name)
class_data[mask, ]
Explanation: grepl() returns TRUE for pattern matches. This selects rows where the name begins with "Stu0".
Example 7 — Remove rows with missing values using complete.cases()
# Introduce NA for demonstration (do not modify original) tmp <- class_data tmp$math_score[c(2, 5)] <- NA # Keep only complete rows (no NA in any column) complete_rows <- tmp[complete.cases(tmp), ] print(complete_rows)
Explanation: complete.cases() returns TRUE for rows without any NA values. Use it to clean datasets before modeling or summarizing.
Example 8 — Select specific columns while filtering
# Filter Pass students and only show ID, Name, math_score
pass_summary <- class_data[class_data$status == "Pass", c("ID","Name","math_score")]
head(pass_summary)
Explanation: The second argument of [rows, cols] can select specific columns by name or numeric index—useful for concise output.
Tips and common pitfalls
- Use
&and|for element-wise logicals; avoid&&and||when filtering. - Check lengths of logical vectors; a recycled shorter vector can produce unexpected results.
- Prefer explicit checks before filtering (e.g., confirm column existence with
"math_score" %in% names(class_data)). - Use
which()when you need row numbers for further processing or reports.
Practice Exercises (Self-assessment)
- Using
class_data, produce a data frame of students aged 16 or older who passed. Show code and resulting rows. - Find all male students with math_score between 65 and 85 (inclusive). Provide the R code and output.
- Use
subset()to list students whoseNamecontains '05' (pattern matching). Show code and result. - Demonstrate how to safely attempt to filter on a non-existent column (e.g.,
height) and print a user-friendly message instead of an error. - Create a filtered summary that selects rows with no missing values and displays only
ID, Name, math_score. Show code and a short explanation.
Answer Format (How to present answers)
## Exercise #n — Short title # R code ...R code here... # Output (printed): ...expected printed output (e.g., print(...), head(...))... # Short explanation (2-4 sentences) Explanation...
Example Solutions (Concise)
# Ex 1: Age >= 16 and Pass
subset(class_data, Age >= 16 & status == "Pass")
# Ex 2: Male with math_score between 65 and 85
class_data[class_data$Gender == "M" & class_data$math_score >= 65 & class_data$math_score <= 85, ]
# Ex 3: Name contains '05'
subset(class_data, grepl("05", Name))
# Ex 4: Safe filter on non-existent column
if("height" %in% names(class_data)){
class_data[class_data$height > 150, ]
} else {
message("Column 'height' not found. No filtering applied.")
}
# Ex 5: Keep complete rows and show columns
tmp <- class_data
tmp$math_score[c(2,5)] <- NA # simulate NA
tmp_clean <- tmp[complete.cases(tmp), c("ID","Name","math_score")]
head(tmp_clean)
Closing notes
Filtering is the core skill for preparing data subsets for analysis and visualization. Practice using logical indexing, subset(), which(), %in%, and grepl() on the class_data dataset until these patterns become second nature. Always check for missing values and verify column names before filtering.

