Flights and pipes (Complete)

Data visualization and transformation
Data Science with R

Introduction

Have you ever had a flight delayed? How common is that? Which airlines are delayed more than others?

Packages

We will use the tidyverse packages for data wrangling and visualization and the nycflights13 package for the data.

library(tidyverse)
library(nycflights13)

Data

The dataset we will explore is called flights. Let’s glimpse() at it.

glimpse(flights)

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Exploring flights

Exercise 1

Fill in the blanks:

The flights data frame has 336776 rows. Each row represents a flight out of one of three NYC airports in 2013.

Exercise 2

What are the names of the variables in flights.

names(flights)

 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"

Exercise 3 - `select()`

Make a data frame that only contains the variables dep_delay and arr_delay.

flights |>
  select(dep_delay, arr_delay)

# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# ℹ 336,766 more rows

Make a data frame that keeps every variable except dep_delay.

flights |>
  select(!dep_delay)

# A tibble: 336,776 × 18
    year month   day dep_time sched_dep_time arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515      830            819        11
 2  2013     1     1      533            529      850            830        20
 3  2013     1     1      542            540      923            850        33
 4  2013     1     1      544            545     1004           1022       -18
 5  2013     1     1      554            600      812            837       -25
 6  2013     1     1      554            558      740            728        12
 7  2013     1     1      555            600      913            854        19
 8  2013     1     1      557            600      709            723       -14
 9  2013     1     1      557            600      838            846        -8
10  2013     1     1      558            600      753            745         8
# ℹ 336,766 more rows
# ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Make a data frame that includes all variables between year through dep_delay (inclusive). These are all variables that provide information about the departure of each flight.

flights |>
  select(year:dep_delay)

# A tibble: 336,776 × 6
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2
 2  2013     1     1      533            529         4
 3  2013     1     1      542            540         2
 4  2013     1     1      544            545        -1
 5  2013     1     1      554            600        -6
 6  2013     1     1      554            558        -4
 7  2013     1     1      555            600        -5
 8  2013     1     1      557            600        -3
 9  2013     1     1      557            600        -3
10  2013     1     1      558            600        -2
# ℹ 336,766 more rows

Use the select helper contains() to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr_" in the name.

flights |>
  select(contains("arr_"))

# A tibble: 336,776 × 3
   arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>
 1      830            819        11
 2      850            830        20
 3      923            850        33
 4     1004           1022       -18
 5      812            837       -25
 6      740            728        12
 7      913            854        19
 8      709            723       -14
 9      838            846        -8
10      753            745         8
# ℹ 336,766 more rows

Exercise 4 - `slice()`

Display the first five rows of the flights data frame.

flights |>
  slice(1:5)

# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Display the last two rows of the flights data frame.

flights |>
  slice_tail(n = 2)

# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1159        NA       NA           1344
2  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 5 - `arrange()`

Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.

flights |>
  arrange(dep_delay)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12     7     2040           2123       -43       40           2352
 2  2013     2     3     2022           2055       -33     2240           2338
 3  2013    11    10     1408           1440       -32     1549           1559
 4  2013     1    11     1900           1930       -30     2233           2243
 5  2013     1    29     1703           1730       -27     1947           1957
 6  2013     8     9      729            755       -26     1002            955
 7  2013    10    23     1907           1932       -25     2143           2143
 8  2013     3    30     2030           2055       -25     2213           2250
 9  2013     3     2     1431           1455       -24     1601           1631
10  2013     5     5      934            958       -24     1225           1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

What does it mean for the dep_delay to have a negative value?

Flight took off 43 minutes earlier than it was scheduled.

Arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.

flights |>
  arrange(desc(dep_delay))

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Create a data frame that only includes the plane tail number (tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?

flights |>
  arrange(desc(dep_delay)) |>
  slice_head(n = 1) |>
  select(tailnum, carrier, dep_delay)

# A tibble: 1 × 3
  tailnum carrier dep_delay
  <chr>   <chr>       <dbl>
1 N384HA  HA           1301

Exercise 6 - `filter()`

Filter for all rows where the destination airport is RDU.

flights |>
  filter(dest == "RDU")

# A tibble: 8,163 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1240           1235         5     1415           1415
 9  2013     1     1     1317           1325        -8     1454           1505
10  2013     1     1     1449           1450        -1     1651           1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Filter for all rows where the destination airport is RDU and the arrival delay is less than 0.

flights |>
  filter(dest == "RDU" & arr_delay < 0)

# A tibble: 4,232 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1505           1510        -5     1654           1655
10  2013     1     1     1800           1800         0     1945           1951
# ℹ 4,222 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Describe what the code is doing in words.

Start with the flights data frame, and then filter for rows where destination is RDU or GSO and arrival delay or departure delay is less than 0.

flights |>
  filter(
    dest %in% c("RDU", "GSO"),
    arr_delay < 0 | dep_delay < 0
  )

# A tibble: 6,203 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1449           1450        -1     1651           1640
10  2013     1     1     1505           1510        -5     1654           1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Hint: Logical operators in R:

operator	definition
`<`	is less than?
`<=`	is less than or equal to?
`>`	is greater than?
`>=`	is greater than or equal to?
`==`	is exactly equal to?
`!=`	is not equal to?
`x & y`	is x AND y?
`x \| y`	is x OR y?
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`!x`	is not x? (only makes sense if `x` is `TRUE` or `FALSE`)

Exercise 7 - `count()`

Create a frequency table of the destination locations for flights from New York.

flights |>
  count(dest, sort = TRUE)

# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ORD   17283
 2 ATL   17215
 3 LAX   16174
 4 BOS   15508
 5 MCO   14082
 6 CLT   14064
 7 SFO   13331
 8 FLL   12055
 9 MIA   11728
10 DCA    9705
# ℹ 95 more rows

In which month was there the fewest number of flights? How many flights were there in that month?

flights |>
  count(month) |>
  arrange(n)

# A tibble: 12 × 2
   month     n
   <int> <int>
 1     2 24951
 2     1 27004
 3    11 27268
 4     9 27574
 5    12 28135
 6     6 28243
 7     4 28330
 8     5 28796
 9     3 28834
10    10 28889
11     8 29327
12     7 29425

On which date (month + day) was there the largest number of flights? How many flights were there on that day?

flights |>
  count(month, day, sort = TRUE)

# A tibble: 365 × 3
   month   day     n
   <int> <int> <int>
 1    11    27  1014
 2     7    11  1006
 3     7     8  1004
 4     7    10  1004
 5    12     2  1004
 6     7    18  1003
 7     7    25  1003
 8     7    12  1002
 9     7     9  1001
10     7    17  1001
# ℹ 355 more rows

Exercise 8 - `mutate()`

Convert air_time (minutes in the air) to hours and then create a new variable, mph, the miles per hour of the flight.

flights |>
  mutate(
    air_time_hr = air_time / 60,
    mph = distance / air_time_hr
  )

# A tibble: 336,776 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, air_time_hr <dbl>, mph <dbl>

First, count the number of flights each month, and then calculate the proportion of flights in each month. What proportion of flights take place in July?

flights |>
  count(month) |>
  mutate(prop = n / sum(n)) |>
  filter(month == 7)

# A tibble: 1 × 3
  month     n   prop
  <int> <int>  <dbl>
1     7 29425 0.0874

Exercise 9 - `summarize()`

Find mean arrival delay for all flights.

flights |>
  summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE))

# A tibble: 1 × 1
  mean_arr_delay
           <dbl>
1           6.90

Exercise 10 - `group_by()`

Find mean arrival delay for for each month.

flights |>
  group_by(month) |>
  summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE))

# A tibble: 12 × 2
   month mean_arr_delay
   <int>          <dbl>
 1     1          6.13 
 2     2          5.61 
 3     3          5.81 
 4     4         11.2  
 5     5          3.52 
 6     6         16.5  
 7     7         16.7  
 8     8          6.04 
 9     9         -4.02 
10    10         -0.167
11    11          0.461
12    12         14.9

What is the median departure delay for each airports around NYC (origin)? Which airport has the shortest median departure delay?

flights |>
  group_by(origin) |>
  summarize(median_dep_delay = median(dep_delay, na.rm = TRUE))

# A tibble: 3 × 2
  origin median_dep_delay
  <chr>             <dbl>
1 EWR                  -1
2 JFK                  -1
3 LGA                  -3

Introduction

Packages

Data

Exploring flights

Exercise 1

Exercise 2

Exercise 3 - select()

Exercise 4 - slice()

Exercise 5 - arrange()

Exercise 6 - filter()

Exercise 7 - count()

Exercise 8 - mutate()

Exercise 9 - summarize()

Exercise 10 - group_by()