Modeling and inference
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 24.3 1.29 18.8 8.28e-24
2 family_income -0.0431 0.0108 -3.98 2.29e- 4
For each additional $1,000 of family income, we would expect students to receive a net difference of 1,000 * (-0.0431) = -$43.10 in aid on average, i.e. $43.10 less in gift aid, on average.
exactly $43.10 for all students at this school?!
So far we have done lots of estimation (mean, median, slope, etc.), i.e. - used data from samples to calculate sample statistics - which can then be used as estimates for population parameters
If you want to catch a fish, do you prefer a spear or a net?
If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?
A plausible range of values for the population parameter is a confidence interval.
In order to construct a confidence interval we need to quantify the variability of our sample statistic
For example, if we want to construct a confidence interval for a population slope, we need to come up with a plausible range of values around our observed sample slope
This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean
Quantifying this requires a measurement of how much we would expect the sample statistic to vary from sample to sample
Suppose we split a classroom in half down the middle of the classroom and ask each student their heights. Then, we calculate the mean height of students on each side of the classroom. Would you expect these two means to be exactly equal, close but not equal, or wildly different?
Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed?
We can quantify the variability of sample statistics using
or
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 24.3 1.29 18.8 8.28e-24
2 family_income -0.0431 0.0108 -3.98 2.29e- 4
🥾
Generated assuming there are more students like the ones in the observed sample…
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics
Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution
we could keep going…
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 -0.0684 -0.0222
We are 95% confident that for each additional $1,000 of family income, we would expect students to receive $68.41 to $22.23 less in gift aid, on average.