Linear regression with multiple predictors

Modeling and inference

Data Science with R

Packages

DAAG for data
tidyverse for data wrangling and visualization
tidymodels for modeling

library(DAAG)
library(tidyverse)
library(tidymodels)

Data: Book weight and volume

The allbacks data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback

- volume - cubic centimetres

- area - square centimetres

- weight - grams

- cover - hb or pb

# A tibble: 15 × 4
   volume  area weight cover
    <dbl> <dbl>  <dbl> <fct>
 1    885   382    800 hb   
 2   1016   468    950 hb   
 3   1125   387   1050 hb   
 4    239   371    350 hb   
 5    701   371    750 hb   
 6    641   367    600 hb   
 7   1228   396   1075 hb   
 8    412     0    250 pb   
 9    953     0    700 pb   
10    929     0    650 pb   
11   1492     0    975 pb   
12    419     0    350 pb   
13   1010     0    950 pb   
14    595     0    425 pb   
15   1034     0    725 pb

Book weight vs. volume

allbacks_1_fit <- linear_reg() |>
  fit(weight ~ volume, data = allbacks)

tidy(allbacks_1_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)  108.      88.4         1.22 0.245     
2 volume         0.709    0.0975      7.27 0.00000626

Book weight vs. volume and cover

allbacks_2_fit <- linear_reg() |>
  fit(weight ~ volume + cover, data = allbacks)

tidy(allbacks_2_fit)

# A tibble: 3 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)  198.      59.2         3.34 0.00584     
2 volume         0.718    0.0615     11.7  0.0000000660
3 coverpb     -184.      40.5        -4.55 0.000672

Interpretation of estimates

tidy(allbacks_2_fit)

# A tibble: 3 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)  198.      59.2         3.34 0.00584     
2 volume         0.718    0.0615     11.7  0.0000000660
3 coverpb     -184.      40.5        -4.55 0.000672

Slope - volume: Keeping cover constant, for each additional cubic centimetre books are larger in volume, the model predicts the weight to be higher, on average, by 0.718 grams.
Slope - cover: Keeping volume constant, the model predicts that paperback books weigh, on average, by 184 grams less than hardback books.
Intercept: The model predicts that hardback books with 0 volume are expected to weigh 198 grams, on average. (Doesn’t make sense in context.)

\(R^2\)

\(R^2\) is the percentage of variability in the outcome explained by the regression model.

Model 1: weight ~ volume

glance(allbacks_1_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic    p.value    df
      <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>
1     0.803         0.787  124.      52.9 0.00000626     1
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Model 2: weight ~ volume + cover

glance(allbacks_2_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic     p.value    df
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>
1     0.927         0.915  78.2      76.7 0.000000145     2
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

\(R^2\) increases when any predictor is added to the model.

Adjusted \(R^2\)

Adjusted \(R^2\) adds a penalty to \(R^2\) for additional predictors in the model, and is therefore a (more) objective measure for comparing models with different numbers of predictors.

Model 1: weight ~ volume

glance(allbacks_1_fit)$adj.r.squared

[1] 0.7874526

Model 2: weight ~ volume + cover

glance(allbacks_2_fit)$adj.r.squared

[1] 0.9153905

Adjusted \(R^2\) is higher for the model with volume and cover as predictors, and it is therefore the preferable model for predicting weight.

Model 1 - visualized

glance(allbacks_1_fit) |>
  select(r.squared, adj.r.squared)

# A tibble: 1 × 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1     0.803         0.787

Model 2 - visualized

glance(allbacks_2_fit) |>
  select(r.squared, adj.r.squared)

# A tibble: 1 × 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1     0.927         0.915

Takeaways

When interpreting slope coefficients for multiple regression models we need to state that one predictor is kept constant while the other increases.
Adjusted R-squared is useful when comparing models with different numbers of predictors - it helps you balance model complexity with explanatory power.