Modeling fish (Complete)

Introduction

Goal

Practice modeling using the fish dataset on two common fish species in fish market sales.

Packages

We will use the tidyverse package for data wrangling and visualization and the tidymodels package for modeling.

Data

These data come from Kaggle and is commonly used in machine learning examples.

fish <- read_csv("https://data-science-with-r.github.io/data/fish.csv")

The data dictionary is below:

variable description
species Species name of fish
weight Weight, in grams
length_vertical Vertical length, in cm
length_diagonal Diagonal length, in cm
length_cross Cross length, in cm
height Height, in cm
width Diagonal width, in cm

Let’s take a look at the data.

fish
# A tibble: 55 × 7
   species weight length_vertical length_diagonal length_cross height width
   <chr>    <dbl>           <dbl>           <dbl>        <dbl>  <dbl> <dbl>
 1 Bream      242            23.2            25.4         30     11.5  4.02
 2 Bream      290            24              26.3         31.2   12.5  4.31
 3 Bream      340            23.9            26.5         31.1   12.4  4.70
 4 Bream      363            26.3            29           33.5   12.7  4.46
 5 Bream      430            26.5            29           34     12.4  5.13
 6 Bream      450            26.8            29.7         34.7   13.6  4.93
 7 Bream      500            26.8            29.7         34.5   14.2  5.28
 8 Bream      390            27.6            30           35     12.7  4.69
 9 Bream      450            27.6            30           35.1   14.0  4.84
10 Bream      500            28.5            30.7         36.2   14.2  4.96
# ℹ 45 more rows

Analysis

Visualizing the model

We’re going to investigate the relationship between the weights and heights of fish, predicting weight from height.

  • Create an appropriate plot to investigate this relationship. Add appropriate labels to the plot.
ggplot(fish, aes(x = height, y = weight)) +
  geom_point() +
  labs(
    title = "Weights vs. heights of fish",
    x = "Height (cm)",
    y = "Weight (gr)"
  )

  • If you were to draw a a straight line to best represent the relationship between the heights and weights of fish, where would it go? Why?

Start from the bottom and go up. Identify the first and last point and draw a line through most the others.

ggplot(fish, aes(x = height, y = weight)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Weights vs. lengths of fish",
    x = "Head-to-tail lentgh (cm)",
    y = "Weight of fish (grams)"
  )
`geom_smooth()` using formula = 'y ~ x'

  • What types of questions can this plot help answer?

Is there a relationship between fish heights and weights of fish?

  • We can use this line to make predictions. Predict what you think the weight of a fish would be with a height of 10 cm, 15 cm, and 20 cm. Which prediction is considered extrapolation?

At 10 cm, we estimate a weight of 375 grams. At 15 cm, we estimate a weight of 600 grams At 20 cm, we estimate a weight of 975 grams. 20 cm would be considered extrapolation.

  • What is a residual?

Difference between predicted and observed.

Model fitting

  • Fit a model to predict fish weights from their heights.
fish_hw_fit <- linear_reg() |>
  fit(weight ~ height, data = fish)

fish_hw_fit
parsnip model object


Call:
stats::lm(formula = weight ~ height, data = data)

Coefficients:
(Intercept)       height  
    -288.42        60.92  
  • Predict what the weight of a fish would be with a height of 10 cm, 15 cm, and 20 cm using this model.
x <- c(10, 15, 20)
-288 + 60.92 * x
[1] 321.2 625.8 930.4
  • Calculate predicted weights for all fish in the data and visualize the residuals under this model.
fish_hw_aug <- augment(fish_hw_fit, new_data = fish)

ggplot(fish_hw_aug, aes(x = height, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  
  geom_segment(aes(xend = height, yend = .pred), color = "gray") +  
  geom_point(aes(y = .pred), shape = "circle open") + 
  theme_minimal() +
  labs(
    title = "Weights vs. heights of fish",
    subtitle = "Residuals",
    x = "Height (cm)",
    y = "Weight (gr)"
  )
`geom_smooth()` using formula = 'y ~ x'

Model summary

  • Display the model summary including estimates for the slope and intercept along with measurements of uncertainty around them. Show how you can extract these values from the model output.
fish_hw_tidy <- tidy(fish_hw_fit)
fish_hw_tidy
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -288.      34.0      -8.49 1.83e-11
2 height          60.9      2.64     23.1  2.40e-29
  • Write out your model using mathematical notation.

\(\widehat{weight} = -288 + 60.9 \times height\)

Correlation

We can also assess correlation between two quantitative variables.

  • What is correlation? What are values correlation can take?

Strength and direction of a linear relationship. It’s bounded by -1 and 1.

fish |>
  summarize(r = cor(height, weight))
# A tibble: 1 × 1
      r
  <dbl>
1 0.954

Adding a third variable

  • Does the relationship between heights and weights of fish change if we take into consideration species? Plot two separate straight lines for the Bream and Roach species.
ggplot(fish, aes(x = height, y = weight, color = species)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Weights vs. heights of fish",
    x = "Height (cm)",
    y = "Weight (gr)"
  )
`geom_smooth()` using formula = 'y ~ x'

Fitting other models

  • We can fit more models than just a straight line. Use method = "loess". What is different from the plot created before?
ggplot(fish, aes(x = height, y = weight)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(
    title = "Weights vs. heights of fish",
    x = "Height (cm)",
    y = "Weight (gr)"
  )
`geom_smooth()` using formula = 'y ~ x'