Exploring Built-in Datasets in R: Learn mtcars, iris & airquality with Summary, Structure & Manipulation

R Built-in Datasets Study Guide

📊 Exploring Built-in Datasets in R: Learn mtcars, iris & airquality with Summary, Structure & Manipulation

R provides several built-in datasets that help learners explore and practice data analysis easily without the need to import external data. These datasets, such as mtcars, iris, and airquality, are excellent for understanding R’s data structures, descriptive statistics, and manipulation techniques.

Each dataset serves a unique learning purpose. For example, mtcars includes automobile data like mileage, horsepower, and weight; iris contains flower measurements for different species; and airquality offers daily air measurements in New York. Students can explore them using R’s built-in functions like summary(), str(), head(), and nrow().

Below are detailed explanations, examples, and R code snippets to help understand and manipulate each dataset effectively.

🚗 mtcars Dataset

The mtcars dataset contains information about fuel consumption and other automobile design aspects for 32 car models. It has 11 columns, including variables such as mpg (miles per gallon), hp (horsepower), and wt (weight in 1000 lbs).

Let’s start by loading and exploring the dataset:

# Load the dataset
data(mtcars)

# Display the first few rows
head(mtcars)

# Get the structure of the dataset
str(mtcars)

# Summary statistics for each column
summary(mtcars)

Explanation:

data(mtcars) loads the dataset into the R environment.
head() displays the first six rows to understand the data layout.
str() shows variable types (numeric, factor, etc.).
summary() provides quick descriptive statistics for each column.

Originally, the wt column in the mtcars dataset represents the car weight in 1000 lbs. For better understanding and comparison with international standards, we can convert this weight into kilograms (kg).

Here’s how to perform the conversion correctly, save it in a new dataset called data_mtcars, and replace the original wt column:

# 1️⃣ Make a copy of the mtcars dataset
data_mtcars <- mtcars

# 2️⃣ Create a new column 'weight_kg' (wt is in 1000 lbs)
data_mtcars$weight_kg <- data_mtcars$wt * 1000 * 0.45359237

# 3️⃣ Replace the old 'wt' column with the new one (in kilograms)
data_mtcars$wt <- data_mtcars$weight_kg

# 4️⃣ Remove the extra 'weight_kg' column to avoid duplication
data_mtcars$weight_kg <- NULL

# 5️⃣ Rename 'wt' to 'weight_kg' for clarity
colnames(data_mtcars)[colnames(data_mtcars) == "wt"] <- "weight_kg"

# 6️⃣ View the updated dataset
head(data_mtcars)

Step-by-Step Explanation:

Step 1: Make a copy of the dataset using data_mtcars <- mtcars so the original data remains safe.
Step 2: Convert the weight from 1000 lbs to kilograms using the formula wt * 1000 * 0.45359237 (since 1 lb = 0.45359237 kg).
Step 3: Replace the original wt column values with the newly calculated weight in kilograms.
Step 4: Remove the temporary weight_kg column to keep the dataset tidy.
Step 5: Rename wt to weight_kg to reflect the new unit of measurement.
Step 6: Display the updated dataset using head() to verify the changes.

After executing the above steps, your data_mtcars dataset will have the same structure as the original mtcars dataset, but the weight_kg column will now show weights in kilograms instead of 1000 lbs.

🌸 iris Dataset

The iris dataset is one of the most famous datasets in statistics and machine learning. It contains 150 observations of three species of iris flowers — setosa, versicolor, and virginica — with four numeric features: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.

Let's start by exploring and filtering the dataset:

# Load the dataset
data(iris)

# View structure and summary
str(iris)
summary(iris)

# Filter data for a specific species (Setosa)
iris_setosa <- subset(iris, Species == "setosa")

# View filtered data
head(iris_setosa)

Explanation:

data(iris) loads the dataset into the R environment.
str() displays the data structure (types and dimensions).
summary() gives quick descriptive statistics for each variable.
subset() filters records matching a specific condition — here, only the setosa species.

🔹 Exploring iris Further: Group-wise Averages using `aggregate()`

We can calculate average measurements (mean) for each species to understand how they differ in size:

# Calculate mean values for each feature grouped by species
species_means <- aggregate(. ~ Species, data = iris, FUN = mean)

# View results
species_means

Explanation:

aggregate() groups data by the Species column.
The . ~ Species formula means "apply the function to all other columns grouped by Species."
FUN = mean computes the average value for each numeric variable.
The result shows how sepal and petal sizes vary across species.

🔹 Sorting Data by Petal Length using `order()`

We can sort the dataset by petal length to see the smallest and largest flowers:

# Sort the iris dataset by Petal.Length (ascending order)
iris_sorted <- iris[order(iris$Petal.Length), ]

# Display the first 6 and last 6 rows
head(iris_sorted)
tail(iris_sorted)

Explanation:

order() sorts a column in ascending order by default.
iris[order(iris$Petal.Length), ] reorders all rows based on Petal.Length.
head() and tail() show the smallest and largest petal lengths respectively.

🔹 Bonus: Subsetting Multiple Conditions

We can also filter the data based on multiple conditions — for example, all flowers of species versicolor with Petal.Length > 4 cm:

# Filter flowers that are Versicolor and have Petal.Length > 4 cm
iris_versicolor_long <- subset(iris, Species == "versicolor" & Petal.Length > 4)

# View result
head(iris_versicolor_long)

Explanation:

Multiple conditions can be combined using logical operators (& for AND, | for OR).
This helps in extracting targeted subsets of data for deeper analysis.
It's an essential skill when cleaning or preparing data for modeling.

Overall, the iris dataset is excellent for practicing concepts like filtering, grouping, sorting, and conditional subsetting — fundamental steps in any data analysis or machine learning workflow in R.

🌤️ airquality Dataset

The airquality dataset contains daily air quality measurements in New York from May to September 1973. It includes variables such as:

Ozone – Mean ozone in parts per billion (ppb)
Solar.R – Solar radiation in Langleys
Wind – Average wind speed (mph)
Temp – Maximum daily temperature (°F)
Month – Month of observation (5 = May, ..., 9 = September)
Day – Day of the month

This dataset is useful for learning data cleaning, transformation, and basic statistical analysis in R.

🔹 Step 1: Load and Inspect the Dataset

# Load the dataset
data(airquality)

# Display first few rows
head(airquality)

# View structure and summary
str(airquality)
summary(airquality)

Explanation:

data(airquality) loads the dataset from R's built-in datasets package.
head() shows the first few rows, helping you see variable names and formats.
str() reveals data types (numeric, integer, etc.) and number of observations.
summary() provides minimum, mean, median, and maximum values for each column.

🔹 Step 2: Handling Missing Values

Some columns (like Ozone and Solar.R) contain missing values (NAs). We can detect and handle them using:

# Count missing values in Ozone
sum(is.na(airquality$Ozone))

# Replace missing Ozone values with mean
airquality$Ozone[is.na(airquality$Ozone)] <- mean(airquality$Ozone, na.rm = TRUE)

# View updated dataset
summary(airquality)

Explanation:

is.na() identifies missing values (returns TRUE/FALSE).
sum() counts total missing entries in the Ozone column.
mean(..., na.rm = TRUE) calculates the average while ignoring NAs.
We then replace all missing values with the calculated mean to make the dataset complete for analysis.

🔹 Step 3: Create New Derived Columns

We can add new columns to make data analysis more meaningful — for example, converting temperature to Celsius and categorizing air quality levels.

# Convert temperature from Fahrenheit to Celsius
airquality$Temp_C <- round((airquality$Temp - 32) * 5/9, 1)

# Create air quality categories based on Ozone levels
airquality$Ozone_Level <- ifelse(airquality$Ozone > 100, "High",
                           ifelse(airquality$Ozone > 50, "Moderate", "Low"))

# View a few rows
head(airquality)

Explanation:

(Temp - 32) * 5/9 converts Fahrenheit to Celsius.
round(..., 1) rounds the result to one decimal place for readability.
ifelse() creates a new categorical column — "Low," "Moderate," or "High" ozone levels.
New derived columns help students connect data cleaning with real-world interpretation.

🔹 Step 4: Monthly Air Quality Averages

We can group data by Month to find how air quality changes over time using aggregate():

# Compute monthly average Ozone and Temperature
monthly_avg <- aggregate(cbind(Ozone, Temp_C) ~ Month, data = airquality, FUN = mean)

# View result
monthly_avg

Explanation:

aggregate() groups data by a variable — here Month.
cbind(Ozone, Temp_C) combines multiple numeric columns for group averaging.
FUN = mean applies the mean function to each group.
The output helps identify which months had higher ozone and temperature levels.

🔹 Step 5: Filtering Specific Conditions

We can filter records where temperature is high and wind speed is low — conditions that may indicate poor air quality:

# Filter for hot days with low wind
poor_air_days <- subset(airquality, Temp_C > 30 & Wind < 8)

# View filtered records
head(poor_air_days)

Explanation:

subset() extracts rows meeting specific logical conditions.
Temp_C > 30 filters hot days.
Wind < 8 filters calm wind conditions (less air movement).
This subset helps analyze when ozone concentrations might be unusually high.

Overall, the airquality dataset helps students practice data cleaning, transformation, group analysis, and conditional filtering — all crucial skills for data analytics in R.

🧠 Practice Exercise for Self-Assessment

Load all three datasets (mtcars, iris, and airquality) into your R environment.
Use summary() and str() to explore each dataset's structure and key statistics.
In mtcars, add a new column for weight in kilograms.
In iris, create a subset containing only the species "virginica."
In airquality, replace missing Ozone values with the column's mean.

✅ Answer & Explanation

# Load datasets
data(mtcars)
data(iris)
data(airquality)

# Explore structure and summary
str(mtcars); summary(mtcars)
str(iris); summary(iris)
str(airquality); summary(airquality)

# Add weight in kilograms (mtcars)
mtcars$weight_kg <- mtcars$wt * 1000 * 0.453592

# Create subset for virginica species (iris)
iris_virginica <- subset(iris, Species == "virginica")

# Replace missing Ozone values (airquality)
airquality$Ozone[is.na(airquality$Ozone)] <- mean(airquality$Ozone, na.rm = TRUE)

This exercise helps students practice dataset exploration, data cleaning, and manipulation—all essential for real-world data analysis in R.

Educational Resources Footer

Exploring Built-in Datasets in R: Learn mtcars, iris & airquality with Summary, Structure & Manipulation