Data types



Data tidying and importing

Data Science with R

Why should you care about data types?

Example: Cat lovers

A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.

cat_lovers <- read_csv("data/cat-lovers.csv")
cat_lovers
# A tibble: 60 × 2
   name           number_of_cats
   <chr>          <chr>         
 1 Bernice Warren 0             
 2 Woodrow Stone  0             
 3 Willie Bass    1             
 4 Tyrone Estrada 3             
 5 Alex Daniels   3             
 6 Jane Bates     2             
 7 Latoya Simpson 1             
 8 Darin Woods    1             
 9 Agnes Cobb     0             
10 Tabitha Grant  0             
# ℹ 50 more rows

Oh why won’t you work?!

cat_lovers |>
  summarize(mean_cats = mean(number_of_cats))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `mean_cats = mean(number_of_cats)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
# A tibble: 1 × 1
  mean_cats
      <dbl>
1        NA

Let’s read the docs!

?mean

Documentation for the mean function.

Oh why won’t you still work??!!

cat_lovers |>
  summarize(mean_cats = mean(number_of_cats, na.rm = TRUE))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `mean_cats = mean(number_of_cats, na.rm =
  TRUE)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
# A tibble: 1 × 1
  mean_cats
      <dbl>
1        NA

Take a breath and look at your data

What is the type of the number_of_cats variable?

glimpse(cat_lovers)
Rows: 60
Columns: 2
$ name           <chr> "Bernice Warren", "Woodrow Stone", …
$ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", …

Let’s take another look

name number_of_cats
Bernice Warren 0
Woodrow Stone 0
Willie Bass 1
Tyrone Estrada 3
Alex Daniels 3
Jane Bates 2
Latoya Simpson 1
Darin Woods 1
Agnes Cobb 0
Tabitha Grant 0
Perry Cross 0
Wanda Silva 0
Alicia Sims 1
Emily Logan 3
Woodrow Elliott 3
Brent Copeland 2
Pedro Carlson 1
Patsy Luna 1
Brett Robbins 0
Oliver George 0

Let’s take another look

name number_of_cats
Calvin Perry 1
Lora Gutierrez 1
Charlotte Sparks 0
Earl Mack 0
Leslie Wade 4
Santiago Barker 0
Jose Bell 0
Lynda Smith 0
Bradford Marshall 0
Irving Miller 0
Caroline Simpson 0
Frances Welch 0
Melba Jenkins 0
Veronica Morales 0
Juanita Cunningham 0
Maurice Howard 0
Teri Pierce 0
Phil Franklin 0
Jan Zimmerman 0
Leslie Price 0

Let’s take another look

name number_of_cats
Bessie Patterson 0
Ethel Wolfe 0
Naomi Wright 1
Sadie Frank 3
Lonnie Cannon 3
Tony Garcia 2
Darla Newton 1
Ginger Clark 1.5 - honestly I think one of my cats is half human
Lionel Campbell 0
Florence Klein 0
Harriet Leonard 1
Terrence Harrington 0
Travis Garner 1
Doug Bass three
Pat Norris 1
Dawn Young 1
Shari Alvarez 1
Tamara Robinson 0
Megan Morgan 0
Kara Obrien 2

Let’s take another look

name number_of_cats
Bessie Patterson 0
Ethel Wolfe 0
Naomi Wright 1
Sadie Frank 3
Lonnie Cannon 3
Tony Garcia 2
Darla Newton 1
Ginger Clark 1.5 - honestly I think one of my cats is half human
Lionel Campbell 0
Florence Klein 0
Harriet Leonard 1
Terrence Harrington 0
Travis Garner 1
Doug Bass three
Pat Norris 1
Dawn Young 1
Shari Alvarez 1
Tamara Robinson 0
Megan Morgan 0
Kara Obrien 2

Let’s take another look

name number_of_cats
Bessie Patterson 0
Ethel Wolfe 0
Naomi Wright 1
Sadie Frank 3
Lonnie Cannon 3
Tony Garcia 2
Darla Newton 1
Ginger Clark 1.5 - honestly I think one of my cats is half human
Lionel Campbell 0
Florence Klein 0
Harriet Leonard 1
Terrence Harrington 0
Travis Garner 1
Doug Bass three
Pat Norris 1
Dawn Young 1
Shari Alvarez 1
Tamara Robinson 0
Megan Morgan 0
Kara Obrien 2

Let’s take another look

name number_of_cats
Bessie Patterson 0
Ethel Wolfe 0
Naomi Wright 1
Sadie Frank 3
Lonnie Cannon 3
Tony Garcia 2
Darla Newton 1
Ginger Clark 1.5 - honestly I think one of my cats is half human
Lionel Campbell 0
Florence Klein 0
Harriet Leonard 1
Terrence Harrington 0
Travis Garner 1
Doug Bass three
Pat Norris 1
Dawn Young 1
Shari Alvarez 1
Tamara Robinson 0
Megan Morgan 0
Kara Obrien 2

Sometimes you might need to babysit your respondents

cat_lovers |>
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ 2,
      name == "Doug Bass"    ~ 3,
      .default = as.numeric(number_of_cats)
      )
    ) |>
  summarize(mean_cats = mean(number_of_cats))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `number_of_cats = case_when(...)`.
Caused by warning in `vec_case_when()`:
! NAs introduced by coercion
# A tibble: 1 × 1
  mean_cats
      <dbl>
1     0.833

Always you need to respect data types

cat_lovers |>#| 
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      .default = number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    ) |>
  summarize(mean_cats = mean(number_of_cats))
# A tibble: 1 × 1
  mean_cats
      <dbl>
1     0.833

Moral of the story

  • If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
  • Go in and investigate your data, apply the fix, save your data, live happily ever after.

Data types

Data types in R

  • logical
  • double
  • integer
  • character
  • and some more, but we won’t be focusing on those

Logical & character

logical - boolean values TRUE and FALSE


typeof(TRUE)
[1] "logical"

character - character strings



typeof("hello")
[1] "character"

Double & integer

double - floating point numerical values (default numerical type)


typeof(1.335)
[1] "double"
typeof(7)
[1] "double"

integer - integer numerical values (indicated with an L)


typeof(7L)
[1] "integer"
typeof(1:3)
[1] "integer"

Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)
[1] 1 2 3
c("Hello", "World!")
[1] "Hello"  "World!"
c(c("hi", "hello"), c("bye", "jello"))
[1] "hi"    "hello" "bye"   "jello"

Converting between types

with intention…

x <- 1:3
x
[1] 1 2 3
typeof(x)
[1] "integer"
y <- as.character(x)
y
[1] "1" "2" "3"
typeof(y)
[1] "character"

Converting between types

with intention…

x <- c(TRUE, FALSE)
x
[1]  TRUE FALSE
typeof(x)
[1] "logical"
y <- as.numeric(x)
y
[1] 1 0
typeof(y)
[1] "double"

Converting between types

without intention…

c(1, "Hello")
[1] "1"     "Hello"

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that’s not always a great thing!

c(FALSE, 3L)
[1] 0 3
c(1.2, 3L)
[1] 1.2 3.0
c(2L, "two")
[1] "2"   "two"

Explicit vs. implicit coercion

Let’s give formal names to what we’ve seen so far:

  • Explicit coercion is when you call a function like as.logical(), as.numeric(), as.integer(), as.double(), or as.character()

  • Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector

Special values

Special values

  • NA: Not available
  • NaN: Not a number
0 / 0
[1] NaN
  • Inf: Positive infinity
pi / 0
[1] Inf
  • -Inf: Negative infinity
-1 * (pi / 0)
[1] -Inf

NAs are special ❄️s

x <- c(1, 2, 3, 4, NA)
mean(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 2.5
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    1.75    2.50    2.50    3.25    4.00       1 

NAs are logical

R uses NA to represent missing values in its data structures.

typeof(NA)
[1] "logical"

Mental model for NAs

  • Unlike NaN, NAs are genuinely unknown values
  • But that doesn’t mean they can’t function in a logical way
  • Let’s think about why NAs are logical…

Why do the following give different answers?

# TRUE or NA
TRUE | NA
[1] TRUE
# FALSE or NA
FALSE | NA
[1] NA

Mental model for NAs

NA is unknown, so it could be TRUE or FALSE

  • TRUE | NA gives TRUE, because the answer is always TRUE whether the unknown NA is actually TRUE or FALSE
TRUE | TRUE  # if NA was TRUE
[1] TRUE
TRUE | FALSE # if NA was FALSE
[1] TRUE
  • FALSE | NA gives FALSE, because the answer changes depending whether the unknown NA is actually TRUE or FALSE
FALSE | TRUE  # if NA was TRUE
[1] TRUE
FALSE | FALSE # if NA was FALSE
[1] FALSE
  • This may not make sense for mathematical operations, but it does make sense in the context of missing data