library(tidyverse)
library(janitor)Student survey (Complete)
Introduction
In this code along we’ll work with a small but pretty “messy” survey data on favorite foods and some other information on school aged children.
Packages
We will use the tidyverse for our analysis.
Data
The data are synthetic, so we can make a few important points quickly.
Analysis
- Read the data in and inspect it.
students_raw <- read_csv("https://data-science-with-r.github.io/data/students-raw.csv")Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students_raw# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
- Fix the variable names.
students_raw |>
janitor::clean_names()# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
- Handle NAs.
read_csv("https://data-science-with-r.github.io/data/students-raw.csv", na = c("", "N/A")) |>
janitor::clean_names()Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
- Inspect variable types and apply fixes where appropriate.
read_csv("https://data-science-with-r.github.io/data/students-raw.csv", na = c("", "N/A")) |>
janitor::clean_names() |>
mutate(
age = if_else(age == "five", "5", age),
age = as.numeric(age)
)Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <dbl>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only NA
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
6 6 Güvenç Attila Ice cream Lunch only 6
- Inspect variable classes and apply fixes where appropriate. Save the resulting data frame as
students.
students <- read_csv("https://data-science-with-r.github.io/data/students-raw.csv", na = c("", "N/A")) |>
janitor::clean_names() |>
mutate(
age = if_else(age == "five", "5", age),
age = as.numeric(age),
meal_plan = as.factor(meal_plan),
meal_plan = fct_relevel(meal_plan, "Lunch only", "Breakfast and lunch")
)Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students |>
group_by(meal_plan) |>
summarize(mean_age = mean(age, na.rm = TRUE))# A tibble: 2 × 2
meal_plan mean_age
<fct> <dbl>
1 Lunch only 5
2 Breakfast and lunch 6
- Write out the
studentsobject to a CSV file in the data folder of your working directory.
write_csv(students, file = "data/students.csv")- Read in the newly created
students.csvand inspect the variable types and classes. Do you observe anything unexpected?
read_csv("data/students.csv")Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): full_name, favourite_food, meal_plan
dbl (2): student_id, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <dbl>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only NA
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
6 6 Güvenç Attila Ice cream Lunch only 6
- Write out the
studentsobject to an RDS file in the data folder of your working directory.
write_rds(students, file = "data/students.rds")- Read in the newly created
students.rdsand inspect the variable types and classes. How is this result different than the CSV file you read in?
read_rds("data/students.rds")# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <fct> <dbl>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only NA
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
6 6 Güvenç Attila Ice cream Lunch only 6