How to Filter Data in R – Use filter() in dplyr

How to Filter Data in R – Use filter() in dplyr and Base R for Data Selection

🎯 Topic: How to Filter Data in R Programming (Base R)

Overview (150+ words):
Filtering is the process of selecting rows from a dataset that meet one or more criteria. In R, filtering is commonly done with base R tools such as logical indexing (df[condition, ]), subset(), which() with row indices, pattern-matching with grepl(), membership testing with %in%, and handling missing values with complete.cases() or is.na(). These methods let you select single-condition rows (e.g., Age >= 15), combine conditions with && / | (logical AND/OR), negate conditions with !, filter by multiple values (e.g., Gender %in% c('M','F')), and extract rows matching text patterns (e.g., names that start with ‘A’). Base R filtering is efficient and widely available—no external packages required—making it ideal for environments where additional libraries are unavailable. Below you’ll find a 30-row dataset called class_data, explained step-by-step, followed by concrete filter examples that use only that dataset. Each code chunk includes a brief explanation so students understand what the code does, why it works, and how to interpret the output.

Dataset: class_data (6 columns, 30 rows)

The sample dataset class_data simulates a small classroom survey with columns: ID, Name, Age, Gender, math_score, status (Pass/Fail). Use only this dataset for the examples below.

R code: Create class_data (base R only)

# Create a reproducible 30-row dataset
set.seed(2025)
ID <- 1:30
Name <- paste0("Stu", sprintf("%02d", ID))
Age <- sample(12:18, 30, replace = TRUE)
Gender <- sample(c("M","F"), 30, replace = TRUE)
math_score <- pmin(pmax(round(rnorm(30, mean = 70, sd = 12)), 0), 100)
status <- ifelse(math_score >= 60, "Pass", "Fail")

class_data <- data.frame(ID, Name, Age, Gender, math_score, status, stringsAsFactors = FALSE)

# Quick check
dim(class_data)   # should be 30 6
head(class_data, 6)
  

Explanation of the dataset

class_data has 30 rows. ID is an integer identifier; Name is a short label; Age is integer (12–18); Gender is 'M' or 'F'; math_score is numeric (0–100); status is 'Pass' or 'Fail' derived from math_score. All filtering examples below use only class_data.

Example 1 — Simple logical filter (rows where math_score ≥ 80)

# Rows where math_score >= 80
high_scorers <- class_data[class_data$math_score >= 80, ]
print(high_scorers)
  

Explanation: Subsetting with df[rows, cols] uses a logical vector for rows. The condition class_data$math_score >= 80 returns TRUE for rows that meet the criterion; those rows are selected. The result is a data frame of students scoring 80 or above.

Example 2 — Multiple conditions (Age ≥ 15 AND status == "Pass")

# Age >= 15 and Pass
older_pass <- class_data[class_data$Age >= 15 && class_data$status == "Pass", ]  # WRONG: avoid && for vectors
# Correct version using &
older_pass <- class_data[class_data$Age >= 15 & class_data$status == "Pass", ]
print(older_pass)
  

Explanation: Use element-wise logical operators & and | (not && or ||) when filtering vectors. This selects rows where both conditions are TRUE.

Example 3 — Use subset() for readable filters

# Using subset() to get female students who failed
female_failed <- subset(class_data, Gender == "F" & status == "Fail")
print(female_failed)
  

Explanation: subset() is convenient and readable: subset(df, condition, select = c(...)). It's equivalent to df[df$... , ] but cleaner for simple tasks.

Example 4 — Use which() to get row indices first

# Which rows have Age == 14?
rows_age14 <- which(class_data$Age == 14)
class_data[rows_age14, ]
  

Explanation: which() returns integer indices of TRUE elements, useful when you need row numbers (for reporting or further indexing).

Example 5 — Filter by multiple values using %in%

# Select students whose Age is 13, 15 or 17
sel_ages <- class_data[class_data$Age %in% c(13,15,17), ]
print(sel_ages)
  

Explanation: %in% checks membership in a vector—handy for selecting several possible values without long OR-chains.

Example 6 — Pattern matching with grepl() (filter names starting with 'Stu0')

# Names starting with "Stu0" (first ten students)
mask <- grepl("^Stu0", class_data$Name)
class_data[mask, ]
  

Explanation: grepl() returns TRUE for pattern matches. This selects rows where the name begins with "Stu0".

Example 7 — Remove rows with missing values using complete.cases()

# Introduce NA for demonstration (do not modify original)
tmp <- class_data
tmp$math_score[c(2, 5)] <- NA

# Keep only complete rows (no NA in any column)
complete_rows <- tmp[complete.cases(tmp), ]
print(complete_rows)
  

Explanation: complete.cases() returns TRUE for rows without any NA values. Use it to clean datasets before modeling or summarizing.

Example 8 — Select specific columns while filtering

# Filter Pass students and only show ID, Name, math_score
pass_summary <- class_data[class_data$status == "Pass", c("ID","Name","math_score")]
head(pass_summary)
  

Explanation: The second argument of [rows, cols] can select specific columns by name or numeric index—useful for concise output.

Tips and common pitfalls

  • Use & and | for element-wise logicals; avoid && and || when filtering.
  • Check lengths of logical vectors; a recycled shorter vector can produce unexpected results.
  • Prefer explicit checks before filtering (e.g., confirm column existence with "math_score" %in% names(class_data)).
  • Use which() when you need row numbers for further processing or reports.

Practice Exercises (Self-assessment)

  1. Using class_data, produce a data frame of students aged 16 or older who passed. Show code and resulting rows.
  2. Find all male students with math_score between 65 and 85 (inclusive). Provide the R code and output.
  3. Use subset() to list students whose Name contains '05' (pattern matching). Show code and result.
  4. Demonstrate how to safely attempt to filter on a non-existent column (e.g., height) and print a user-friendly message instead of an error.
  5. Create a filtered summary that selects rows with no missing values and displays only ID, Name, math_score. Show code and a short explanation.

Answer Format (How to present answers)

## Exercise #n — Short title
# R code
...R code here...

# Output (printed):
...expected printed output (e.g., print(...), head(...))...

# Short explanation (2-4 sentences)
Explanation...
  

Example Solutions (Concise)

# Ex 1: Age >= 16 and Pass
subset(class_data, Age >= 16 & status == "Pass")

# Ex 2: Male with math_score between 65 and 85
class_data[class_data$Gender == "M" & class_data$math_score >= 65 & class_data$math_score <= 85, ]

# Ex 3: Name contains '05'
subset(class_data, grepl("05", Name))

# Ex 4: Safe filter on non-existent column
if("height" %in% names(class_data)){
  class_data[class_data$height > 150, ]
} else {
  message("Column 'height' not found. No filtering applied.")
}

# Ex 5: Keep complete rows and show columns
tmp <- class_data
tmp$math_score[c(2,5)] <- NA  # simulate NA
tmp_clean <- tmp[complete.cases(tmp), c("ID","Name","math_score")]
head(tmp_clean)
  

Closing notes

Filtering is the core skill for preparing data subsets for analysis and visualization. Practice using logical indexing, subset(), which(), %in%, and grepl() on the class_data dataset until these patterns become second nature. Always check for missing values and verify column names before filtering.

Educational Resources Footer