Nobel Prize winners and sales

Packages

We will use the following two packages in this application exercise.

tidyverse: For data import, wrangling, and visualization.
readxl: For importing data from MS Excel.

Let’s first load the packages.

library(tidyverse)
library(readxl)

Reading CSV files

For this programming exercise, we are going to work with a data on Nobel Prize winners from the years 1901 through 2018, called nobel.csv.

Load the dataset with read_csv() and save it as a data frame (a tibble) called nobel. How many rows are there in nobel and what does each row represent? How many columns are there in nobel?

Now, read the data in again, but this time add the argument show_col_types to the read_csv() function and set it to `FALSE. This will suppress the information on column specifications (what type each column is).

Variable types and classes

We are going to use the nobel dataset to explore how to change types of variables and what can go wrong along the way.

Using glimpse(), identify what type of variable category is in the nobel data frame. Additionally, identify the values this variable takes.

Solution

category is a character variable. This variable takes on the following values: Physics, Chemistry, Peace, Medicine, Literature, and Economics.

glimpse(nobel)

nobel |>
  distinct(category)

Suppose we want to convert the the levels of category into numeric values.

Run the following code. What happens?

It makes sense that R can’t just represent character strings as numbers. Instead of converting a character variable to a numeric, we might consider converting it to a factor variable instead – how R represents categorical data.

Convert category to factor and save it in the nobel data frame. Then, report its levels.

Reading Excel files

For the remainder of this programming exercise, we are going read in data from an Excel file and then clean it up.

First, read the data in with read_excel from the Excel file called "sales.xlsx" and print the data to screen. Is this a tidy data frame?

See the help file for read_excel() here. Notice that, in the current sales dataset, we want to skip 3 rows (the first row acts as a header for the column names). We can do this using the skip argument. At the same time, we can specify names for each column by giving the argument col_names a character vector.

Read in the file called "sales.xlsx", experimenting with skip and col_names arguments to skip 3 lines and name the columns id and n.

Putting it all together

Manipulate the sales data such such that it looks like the following using functions you have learned from the tidyverse to transform data.

Hint 3

If a row has the text "Brand", leave the value of id, otherwise name it NA.

sales |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  ___
  ___
  ___

Hint 4

Fill in empty rows in the brand column.

sales |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  ___
  ___

Hint 5

Filter out any rows with NA in the is_brand_name column.

sales |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  filter(!is_brand_name) |>
  ___

Solution

Select the columns brand, id, and n.

sales |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  filter(!is_brand_name) |>
  select(brand, id, n)

sales

Takeaway

We should write code to manipulate data instead of opening Excel and editing the data in there because it leaves a transparent trail of any modifications done while manipulating the data. Making changes directly to an Excel file to reshape and clean data is neither reproducible nor reusable.