Predicting NBA salaries with multiple linear regression
Modeling Inference
Data Science with R
Getting Started
Programming exercises are designed to provide an opportunity for you to put what you learn in the videos and readings. These exercises feature interactive code cells which allow you to write, edit, and run R code without leaving your browser.
When the ▶️ Run Code button turns to a solid color (with no flashing bubble indicating that the document is still loading), you can interact with the code cells!
Packages
We’ll use the tidyverse and tidymodels for this programming exercise. These are already installed for you to use!
Motivation
The National Basketball Association (NBA) is a professional sports league that consists of 30 teams across the United States and Canada. In the early 2000s, NBA teams started to use advanced analytics to gain a competitive edge over their opponents. Teams employ the use of analytics to make data driven decisions on game strategy and business operations. In this programming exercise, we are going to work with NBA data to try and better understand the salary NBA players make.
Data
This dataset consists of player per-game statistics for the NBA’s 2022-23 season with player salary data. We are going to use just a subset of these variables in this programming exercise. The variables we will use can be seen below. The data key for the entire dataset can be revealed below.
Variable | Description |
---|---|
Salary | Yearly salary a NBA players make in USD |
Position | Position on the court the NBA player plays: PG, SG, SF, PF, and C |
Age | Age of the player, rounded to the nearest year |
Need to expand
Variable | Description |
---|---|
Salary | Yearly salary a NBA players make in USD |
Position | Position on the court the NBA player plays: PG, SG, SF, PF, and C |
Age | Age of the player, rounded to the nearest year |
Team | Team of the NBA player |
GP | Total number of games played |
GS | Total number of games the player started |
MP | Average number of minutes played |
FG | The total number of shots made (2P + 3P) |
FGA | The total number of shots taken (2PA + 3PA) |
FG% | The percentage of all shots made (FG / FGA) |
3P | The total number of 3-point shots made |
3PA | The total number of 3-point shots taken |
3P% | The percentage of 3-point shots made (3P / 3PA) |
2P | Total number of 2-point shots made |
2PA | Total number of 2-point shots attempted |
eFG% | Effective Field Goal Percentage ( (FG + 0.5*3P) / FGA) ) |
FT | Total number of free throws made |
FTA | The total free throws attempted |
FT% | The percent of FTs made (FT / FTA) |
ORB | Average offensive rebounds per game |
DRB | Average defensive rebounds per game |
TRB | Total rebounds (ORB + DRB) per game |
AST | Average assists per game |
STL | Average steals per game played |
BLK | Average blocks per game played |
TOV | Average turnovers per game |
PF | Average fouls accrued per game played |
PTS | Average points scored per game played |
Now, let’s explore these data!
Exploratory data analysis
Before we fit a linear regression model, we are going to explore the data. Let’ start with exploring our response variable of interest, cholesterol (Salary
). Specifically, we are interested in the relationship between Salary
, Age
, and Position
. Below, calculate the mean and standard deviation for your quantitative variables. In the same code, produce the count of each position.
nba |>
group_by(Position) |>
summarize(
mean_salary = mean(Salary, na.rm = TRUE),
sd_salary = sd(Salary, na.rm = TRUE),
mean_age = mean(Age, na.rm = TRUE),
sd_age = sd(Age, na.rm = TRUE),
position_n = n()
)
# A tibble: 5 × 6
Position mean_salary sd_salary mean_age sd_age position_n
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 C 7282722. 8983558. 26.3 4.48 91
2 PF 8885045. 10897789. 26.6 4.78 86
3 PG 11579573. 13889342. 26.2 4.35 77
4 SF 8132253. 11055604. 25.6 3.64 91
5 SG 6681301. 8308597. 24.7 3.85 115
Modeling
Visualizing the model
We are now going to plot our response variable Salary
vs Age
using a scatterplot. We are also going to color the points based on Position
. The plot can be seen below.
From the videos, we learned about two different times of multiple linear regression models that we could fit:
Main effects models: The relationship between x and y does not change based on z
Interaction effects models: The relationship between x and y does change by z
(Thought exercise) Based on these definitions, do you think it would be appropriate to fit an main effects or an interaction effects model?
It is justifiable to fit an interaction effects model instead of a main effects model. Based on the scatterplot, we can see that the relationship between Salary and Age changes depending on the position the player plays.
Multiple linear regression
So we can explore both concepts, regardless of your conclusion in the though exercise, we are first going to fit a main effects model. We are going to this to:
Show that R will still fit the model, despite it not being the most appropriate
Practice fitting main effects models
Practice interpreting main effects model output
Main effects model
Fit the main effects model below. Name this model m1
, and wrap this object in the tidy()
function. Next, interpret the estimate for PositionPF
in the context of the problem.
sal_main_fit <- linear_reg() |>
fit(Salary ~ Age + Position, data = nba)
tidy(sal_main_fit)
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -20019456. 2996402. -6.68 6.96e-11
2 Age 1038670. 107316. 9.68 2.88e-20
3 PositionPF 1307286. 1449993. 0.902 3.68e- 1
4 PositionPG 4391276. 1492626. 2.94 3.43e- 3
5 PositionSF 1602852. 1431168. 1.12 2.63e- 1
6 PositionSG 1095288. 1363752. 0.803 4.22e- 1
As an NBA player ages one year, we estimate a mean salary increase of 1,307,286 usd if they play PF.
(Thought exercise) What position do you not see a term for? Where is it?
We don’t see a specific term for C
. That is because it’s our (Intercept)
term! We can interpret the intercept as: For an age of, we estimate the mean salary of an NBA C to be -2,0019,456 usd. Each other position estimate is the estimated difference in Salary
relative to the C, after controlling for Age!
We can change the intercept. See the following code that changes the intercept to PG
, and notice how the estimates change based on what our intercept term is! Note: Age
does not change, because the relationship between Salary and Age does not depend on Position.
nba_diff <- nba |>
mutate(Position = fct_relevel(Position, "PG"))
sal_main_2_fit <- linear_reg() |>
fit(Salary ~ Age + Position, data = nba_diff)
tidy(sal_main_2_fit)
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -15628180. 3018134. -5.18 3.38e- 7
2 Age 1038670. 107316. 9.68 2.88e-20
3 PositionC -4391276. 1492626. -2.94 3.43e- 3
4 PositionPF -3083990. 1512886. -2.04 4.21e- 2
5 PositionSF -2788424. 1494146. -1.87 6.27e- 2
6 PositionSG -3295988. 1429037. -2.31 2.15e- 2
Interaction effects model
Now, it’s time to fit the interaction effects model between Salary
, Age
, and Position
. Please do so below, and interpret the Age:PositionPF interaction term.
sal_int_fit <- linear_reg() |>
fit(Salary ~ Age * Position, data = nba)
tidy(sal_int_fit)
# A tibble: 10 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3607189. 5950994. -0.606 0.545
2 Age 414290. 223212. 1.86 0.0641
3 PositionPF -13289313. 8318463. -1.60 0.111
4 PositionPG -29720679. 8918015. -3.33 0.000931
5 PositionSF -26887800. 9258556. -2.90 0.00386
6 PositionSG -11312489. 8280299. -1.37 0.173
7 Age:PositionPF 556044. 310153. 1.79 0.0737
8 Age:PositionPG 1300074. 335283. 3.88 0.000121
9 Age:PositionSF 1096922. 353991. 3.10 0.00207
10 Age:PositionSG 461940. 321063. 1.44 0.151
For a Power Forward, a one-year increase in age, we estimate on average an increase in salary of approximately 970,334 usd.
Why 970,334? The term Age:PositionPF has a coefficient of 556,044. This is the estimated difference in the slope coefficient vs the Center position. Thus, the slope coefficient specifically for the PF position is estimated to be 414,290 larger than the baseline.
Note: This is the mathematical representation to the plot created above!
If we do change the baseline term, the Age
coefficient will also change, because the interaction term allows for the relationship between Age
and Salary
to change based on Position
. Let’s see this below.
sal_int_2_fit <- linear_reg() |>
fit(Salary ~ Age * Position, data = nba_diff)
tidy(sal_int_2_fit)
# A tibble: 10 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -33327869. 6642038. -5.02 7.54e- 7
2 Age 1714364. 250181. 6.85 2.40e-11
3 PositionC 29720679. 8918015. 3.33 9.31e- 4
4 PositionPF 16431366. 8826050. 1.86 6.33e- 2
5 PositionSF 2832880. 9717161. 0.292 7.71e- 1
6 PositionSG 18408190. 8790091. 2.09 3.68e- 2
7 Age:PositionC -1300074. 335283. -3.88 1.21e- 4
8 Age:PositionPF -744030. 330094. -2.25 2.47e- 2
9 Age:PositionSF -203152. 371586. -0.547 5.85e- 1
10 Age:PositionSG -838134. 340365. -2.46 1.42e- 2
Summary
Multiple linear regression allows for a single quantitative variable to be modeled by > 1 predictor variable.
An main effects model has the restriction of keeping the relationship between x and y consistent , regardless of the values of the other variables in the model.
An interaction effects model relaxes this restriction, and allows the relationship between x and y to change based on values of z.
Regression output has a baseline group when one predictor variable is categorical. This can be found in the
Intercept
term, with the other categorical level coefficients representing the deviation from the baseline.
Your turn: Challenge
Use this space to fit more complicated models, and explore different relationships that model our response variable Salary
! As a reminder, the complete data dictionary can be found above.