Mario games + data visualization

Data visualization and transformation
Data Science with R

Important

Please reference the Meet the toolkit: Programming exercises for information and instructions on how to interact with the programming exercise below.

Getting started

Please run the following code by clicking the green arrow just above the code chunk. When reading in the data, nothing will appear after you click the button. However, clicking the arrow ensures that your data are read in and can be used for the following programming exercise.

In this mini analysis we work with the data from the openintro package in R. These data are auction data from Ebay for the game Mario Kart for the Nintendo Wii, collected in October 2009. A key to these data can be found below:

variable name description
id Auction ID assigned by Ebay.
duration Auction length, in days
n_bids Number of bids
cond Game condition, either new or used
start_pr Start price of the auction
ship_pr Shipping price
total_pr Total price, which equals the auction price plus the shipping price
ship_sp Shipping speed or method
seller_rate The seller’s rating on Ebay. This is the number of positive ratings minus the number of negative ratings for the seller
stock_photo Whether the auction feature photo was a stock photo or not, either yes or no
wheels Number of Wii wheels included in the auction. These are steering wheel attachments to make it seem as though you are actually driving in the game.
title The title of the auctions

Packages

We’ll use tidyverse for the majority of the analysis and scales for pretty plot labels later on. ggridges allows us to make ridge plots, and gridExtra allows to arrange plots next to each other. These are ready to use for you in this programming exercise!

Get to know the data

We can use glimpse() to get an overview (or “glimpse”) of the data. Write the following code below to accomplish this task.

With your output, confirm that:

  • There are 143 rows

  • There are 12 variables (columns) in the dataset

Warning

If you receive the error Error: object ‘mario’ not found, go back and read in your data above.

glimpse(mario)
Rows: 143
Columns: 12
$ id          <dbl> 150377422259, 260483376854, 320432342985, 280405224677, 17…
$ duration    <dbl> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1, 7…
$ n_bids      <dbl> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, 6, …
$ cond        <chr> "new", "used", "new", "new", "new", "new", "used", "new", …
$ start_pr    <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 19.9…
$ ship_pr     <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4.00…
$ total_pr    <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, 47…
$ ship_sp     <chr> "standard", "firstClass", "firstClass", "standard", "media…
$ seller_rate <dbl> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 4858,…
$ stock_photo <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "ye…
$ wheels      <dbl> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2, 2…
$ title       <chr> "~~ Wii MARIO KART &amp; WHEEL ~ NINTENDO Wii ~ BRAND NEW …

Variables of interest

The variables we’ll focus on are the following:

  • total_pr: total price of game sold plus shipping in USD
  • ship_sp: Shipping speed or method
    • firstClass
    • media
    • other
    • parcel
    • priority
    • standard
    • ups3Day
    • upsGround
  • cond: Binary variable representing the condition of the video game
    • new
    • used

Visualizing categorical data with ggplot2

First, let’s explore the variable cond. Specifically, let’s investigate how many new games were sold versus how many used games were sold by creating a barplot. Add the following correct geom_*() to make a barplot of cond below.

  ggplot(
    mario,
    aes(x = cond)
  ) +
    geom_bar()

Next, let’s fill in the bars by the shipping method each game was shipped with ship_sp.

  ggplot(
    mario,
    aes(x = cond, fill = ship_sp)
  ) +
    geom_bar()

The code above uses fill to color the segments of the boxplot by another categorical variable. Below, we change fill to color. What happens? Why?

  ggplot(
    mario,
    aes(x = cond, color = ship_sp)
  ) +
    geom_bar()




fill defines the color in which the geom is filled in with, while color defines the color in which the geom is outlined. A special exception to this is with scatterplots, where the dots are not treated as shapes to be filled in, and instead are filled in by color.

Count vs Proportion

Up to this point, our bar plot has counted up the number of observations for each condition of game, and has been segmented by the count of shipping method. Perhaps it is easier to compare shipping method across condition of game if we looked at the proportion of shipping method within each game. This can be achieved using position = "fill" in the geom_xx() statement. Alter the code below so that it includes position = fill, and comment on the relationship between condition and shipping method.




  ggplot(
    mario,
    aes(x = cond, fill = ship_sp)
  ) +
    geom_bar(position = "fill")

It appears that new Mario games were mainly shipped using upsGround, while used Mario games were shipped using standard shipping.

Relationships between numerical and categorical variables

Up to this point, we have been visualizing the relationship between categorical variables. What if we wanted to look at the relationships between different types of variables?

Boxplots

One way we can investigate the relationship between different types of variables is to create a boxplot. Below, we are going to create a boxplot using geom_boxplot() between the variables cond and total_pr. What information can we gather from the boxplots?




  ggplot(
    mario,
    aes(x = cond, y = total_pr)
  ) +
    geom_boxplot()

We can infer that the median total price for new Mario games is higher for the new condition versus the used condition. There appears to be one outlier in new condition, and four outliers in the used condition. The IQR of the new condition is slightly larger than the used condition.

Violin plot

A violin plot is a lot like a box plot, but also shows us information about the density of the quantitative variable. Below, we have a violin plot that again shows the relationship between the condition (cond) of the Mario game, and the total price (total_pr) of the game (cost + shipping). Describe the relationship below.

  ggplot(
    mario,
    aes(x = cond, y = total_pr)
  ) +
    geom_violin()




It appears that there is a higher density of new games at a higher price than used games. Used games appear to have two potential outliers higher than any new game.

Ridge plots

Ridge plots, similar to violin plots, shows the distribution of a numeric variable across the levels of a categorical variable. In order to make this plot, we will use geom_density_ridges(). Add this geom to the following code below to make the ridge plots. Within this geom, set alpha equal to 0.5.

  ggplot(
    mario,
    aes(x = total_pr, y = cond, fill = cond, color = cond)
  ) +
    geom_density_ridges(alpha = 0.5)
Picking joint bandwidth of 2.68

Working with dplyr

dplyr is a grammar of data manipulation that helps you work with data. We are going to explore the following dplyr verbs on the mario data set:

In this demonstration, we are going to explore if there is a difference in mean total price for a used game, if the game was sold with or without the stock photo. Let’s assume that we are only interested in new games.

select()

First, let’s use select to only select the columns of the data set we are interested in: total_pr, cond, and stock_photo.

mario |>
  select(total_pr, cond, stock_photo)

Note, you could also reference the column by position, or subtract out the other columns using either names or position.

mario |>
  select(4,7,10)
# A tibble: 143 × 3
   cond  total_pr stock_photo
   <chr>    <dbl> <chr>      
 1 new       51.6 yes        
 2 used      37.0 yes        
 3 new       45.5 no         
 4 new       44   yes        
 5 new       71   yes        
 6 new       45   yes        
 7 used      37.0 yes        
 8 new       54.0 yes        
 9 used      47   yes        
10 used      50   no         
# ℹ 133 more rows
mario |>
  select(-c(1:3,5:6,8:9,11:12))
# A tibble: 143 × 3
   cond  total_pr stock_photo
   <chr>    <dbl> <chr>      
 1 new       51.6 yes        
 2 used      37.0 yes        
 3 new       45.5 no         
 4 new       44   yes        
 5 new       71   yes        
 6 new       45   yes        
 7 used      37.0 yes        
 8 new       54.0 yes        
 9 used      47   yes        
10 used      50   no         
# ℹ 133 more rows

filter()

The function filter() acts on the rows of the data set, and subsets the data set based on a condition. Let’s add on to our code and use the filter() function to subset the data to only look at used games.

mario |>
  select(total_pr, cond, stock_photo) |>
  filter(cond == "used")
# A tibble: 84 × 3
   total_pr cond  stock_photo
      <dbl> <chr> <chr>      
 1     37.0 used  yes        
 2     37.0 used  yes        
 3     47   used  yes        
 4     50   used  no         
 5     43.3 used  yes        
 6     46   used  yes        
 7    327.  used  no         
 8     31   used  yes        
 9     46.5 used  yes        
10     34.5 used  yes        
# ℹ 74 more rows

group_by() and summarize()

Now that we have a subset of our data set that is relavent to our question of interest, we can calcualte the mean total price using group_by() and summarize(). Note that group_by() groups our data frame together by the specified variable so that we can calculate summary statistics (like the mean), at the group level instead of for the entire data frame using summarize(). Report which mean is higher. Is this the result you expected? Why or why not?




mario |>
  select(total_pr, cond, stock_photo) |>
  filter(cond == "used") |>
  group_by(stock_photo) |>
  summarize(mean_pr = mean(total_pr))
# A tibble: 2 × 2
  stock_photo mean_pr
  <chr>         <dbl>
1 no             54.8
2 yes            42.0

The mean total price is higher for a game that did not include a stock photo versus games that did. This is not what I expected, and would have expected that including a stock photo would have increased the price. This may be due to other variables in the data set not being accounted for in our calculations.

Extension: mutate()

The total price reported in our data set is in US dollars (USD). At the time of writing this exercise, the US exchange rate to Canadian currency (CAD) is 1.37. Suppose we wanted to answer the same question as above, but in CAD instead of USD. We can use mutate()to create a new cad_total_pr column before taking the mean cad_total_pr by stock photo. Recreate your table above, but in CAD.

mario |>
  select(total_pr, cond, stock_photo) |>
  mutate(cad_total_pr = total_pr * 1.37) |>
  filter(cond == "used") |>
  group_by(stock_photo) |>
  summarize(mean_pr_cad = mean(cad_total_pr))
# A tibble: 2 × 2
  stock_photo mean_pr_cad
  <chr>             <dbl>
1 no                 75.0
2 yes                57.5