Joining data



Data tidying and importing

Data Science with R

Setup

Task, goal, and tools


have multiple data frames

want to join them together

using tools from the tidyverse

library(tidyverse)

Data: Women in science

Information on 10 women in science who changed the world:

name
Ada Lovelace
Marie Curie
Janaki Ammal
Chien-Shiung Wu
Katherine Johnson
Rosalind Franklin
Vera Rubin
Gladys West
Flossie Wong-Staal
Jennifer Doudna

Input 1: professions

professions
# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist                        

Input 2: dates

dates
# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947       2020
8 Jennifer Doudna          1964         NA

Input 3: works

works
# A tibble: 9 × 2
  name               known_for                              
  <chr>              <chr>                                  
1 Ada Lovelace       first computer algorithm               
2 Marie Curie        theory of radioactivity,  discovery of…
3 Janaki Ammal       hybrid species, biodiversity protection
4 Chien-Shiung Wu    confim and refine theory of radioactiv…
5 Katherine Johnson  calculations of orbital mechanics crit…
6 Vera Rubin         existence of dark matter               
7 Gladys West        mathematical modeling of the shape of …
8 Flossie Wong-Staal first scientist to clone HIV and creat…
9 Jennifer Doudna    one of the primary developers of CRISP…

Desired output

# A tibble: 10 × 5
   name           profession birth_year death_year known_for
   <chr>          <chr>           <dbl>      <dbl> <chr>    
 1 Ada Lovelace   Mathemati…         NA         NA first co…
 2 Marie Curie    Physicist…         NA         NA theory o…
 3 Janaki Ammal   Botanist         1897       1984 hybrid s…
 4 Chien-Shiung … Physicist        1912       1997 confim a…
 5 Katherine Joh… Mathemati…       1918       2020 calculat…
 6 Rosalind Fran… Chemist          1920       1958 <NA>     
 7 Vera Rubin     Astronomer       1928       2016 existenc…
 8 Gladys West    Mathemati…       1930         NA mathemat…
 9 Flossie Wong-… Virologis…       1947       2020 first sc…
10 Jennifer Doud… Biochemist       1964         NA one of t…

Inputs, reminder

names(professions)
[1] "name"       "profession"
nrow(professions)
[1] 10
names(dates)
[1] "name"       "birth_year" "death_year"
nrow(dates)
[1] 8
names(works)
[1] "name"      "known_for"
nrow(works)
[1] 9

Joining data frames

Joining data frames

something_join(x, y)
  • left_join(): all rows from x
  • right_join(): all rows from y
  • full_join(): all rows from both x and y
  • semi_join(): all rows from x where there are matching values in y, keeping just columns from x
  • inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
  • anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x

Setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )

x
# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3     
y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )

y
# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4     

left_join()

left_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   

left_join()

professions |>
  left_join(dates)
Joining with `by = join_by(name)`
# A tibble: 10 × 4
   name               profession       birth_year death_year
   <chr>              <chr>                 <dbl>      <dbl>
 1 Ada Lovelace       Mathematician            NA         NA
 2 Marie Curie        Physicist and C…         NA         NA
 3 Janaki Ammal       Botanist               1897       1984
 4 Chien-Shiung Wu    Physicist              1912       1997
 5 Katherine Johnson  Mathematician          1918       2020
 6 Rosalind Franklin  Chemist                1920       1958
 7 Vera Rubin         Astronomer             1928       2016
 8 Gladys West        Mathematician          1930         NA
 9 Flossie Wong-Staal Virologist and …       1947       2020
10 Jennifer Doudna    Biochemist             1964         NA

right_join()

right_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4     

left_join()

professions |>
  right_join(dates)
Joining with `by = join_by(name)`
# A tibble: 8 × 4
  name               profession        birth_year death_year
  <chr>              <chr>                  <dbl>      <dbl>
1 Janaki Ammal       Botanist                1897       1984
2 Chien-Shiung Wu    Physicist               1912       1997
3 Katherine Johnson  Mathematician           1918       2020
4 Rosalind Franklin  Chemist                 1920       1958
5 Vera Rubin         Astronomer              1928       2016
6 Gladys West        Mathematician           1930         NA
7 Flossie Wong-Staal Virologist and M…       1947       2020
8 Jennifer Doudna    Biochemist              1964         NA

full_join()

full_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4     

left_join()

professions |>
  full_join(dates)
Joining with `by = join_by(name)`
# A tibble: 10 × 4
   name               profession       birth_year death_year
   <chr>              <chr>                 <dbl>      <dbl>
 1 Ada Lovelace       Mathematician            NA         NA
 2 Marie Curie        Physicist and C…         NA         NA
 3 Janaki Ammal       Botanist               1897       1984
 4 Chien-Shiung Wu    Physicist              1912       1997
 5 Katherine Johnson  Mathematician          1918       2020
 6 Rosalind Franklin  Chemist                1920       1958
 7 Vera Rubin         Astronomer             1928       2016
 8 Gladys West        Mathematician          1930         NA
 9 Flossie Wong-Staal Virologist and …       1947       2020
10 Jennifer Doudna    Biochemist             1964         NA

inner_join()

inner_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     

inner_join()

professions |>
  inner_join(dates)
Joining with `by = join_by(name)`
# A tibble: 8 × 4
  name               profession        birth_year death_year
  <chr>              <chr>                  <dbl>      <dbl>
1 Janaki Ammal       Botanist                1897       1984
2 Chien-Shiung Wu    Physicist               1912       1997
3 Katherine Johnson  Mathematician           1918       2020
4 Rosalind Franklin  Chemist                 1920       1958
5 Vera Rubin         Astronomer              1928       2016
6 Gladys West        Mathematician           1930         NA
7 Flossie Wong-Staal Virologist and M…       1947       2020
8 Jennifer Doudna    Biochemist              1964         NA

semi_join()

semi_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     

semi_join()

professions |>
  semi_join(dates)
Joining with `by = join_by(name)`
# A tibble: 8 × 2
  name               profession                        
  <chr>              <chr>                             
1 Janaki Ammal       Botanist                          
2 Chien-Shiung Wu    Physicist                         
3 Katherine Johnson  Mathematician                     
4 Rosalind Franklin  Chemist                           
5 Vera Rubin         Astronomer                        
6 Gladys West        Mathematician                     
7 Flossie Wong-Staal Virologist and Molecular Biologist
8 Jennifer Doudna    Biochemist                        

anti_join()

anti_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3     

anti_join()

professions |>
  anti_join(dates)
Joining with `by = join_by(name)`
# A tibble: 2 × 2
  name         profession           
  <chr>        <chr>                
1 Ada Lovelace Mathematician        
2 Marie Curie  Physicist and Chemist

Putting it altogether

professions |>
  left_join(dates) |>
  left_join(works)
Joining with `by = join_by(name)`
Joining with `by = join_by(name)`
# A tibble: 10 × 5
   name           profession birth_year death_year known_for
   <chr>          <chr>           <dbl>      <dbl> <chr>    
 1 Ada Lovelace   Mathemati…         NA         NA first co…
 2 Marie Curie    Physicist…         NA         NA theory o…
 3 Janaki Ammal   Botanist         1897       1984 hybrid s…
 4 Chien-Shiung … Physicist        1912       1997 confim a…
 5 Katherine Joh… Mathemati…       1918       2020 calculat…
 6 Rosalind Fran… Chemist          1920       1958 <NA>     
 7 Vera Rubin     Astronomer       1928       2016 existenc…
 8 Gladys West    Mathemati…       1930         NA mathemat…
 9 Flossie Wong-… Virologis…       1947       2020 first sc…
10 Jennifer Doud… Biochemist       1964         NA one of t…