Lecture 26 – Residuals and Inference

DSC 10, Fall 2022

Announcements

Agenda

Residuals

Quality of fit

Example: Non-linear data

This line doesn't fit the data at all, despite being the "best" line for the data!

Residuals

$$\text{error} = \text{actual } y - \text{predicted } y$$
$$\text{residual} = \text{actual } y - \text{predicted } y \text{ by regression line}$$

Calculating residuals

Example: Predicting a son's height from his mother's height 👵👨 📏

Is the association between 'mom' and 'son' linear?

Residual plots

The residual plot for a non-linear association 🚗

Note that as 'mpg' increases, the residuals go from being mostly large, to being mostly small, to being mostly large again. That's a pattern!

Issue: Patterns in the residual plot

Another example: 'mpg' and 'acceleration'

Note that the residuals tend to vary more for smaller accelerations than they do for larger accelerations – that is, the vertical spread of the plot is not similar at all points on the $x$-axis.

Issue: Uneven vertical spread

Example: Anscombe's quartet

Example: The Datasaurus Dozen 🦖

Never trust summary statistics alone; always visualize your data!


(source)

Inference for regression

Another perspective on regression

Concept Check ✅ – Answer at cc.dsc10.com

What strategy will help us assess how different our regression line's predictions would have been if we'd used a different sample?

Prediction intervals

We want to come up with a range of reasonable values for a prediction for a single input $x$. To do so, we'll:

  1. Bootstrap the sample.
  2. Compute the slope and intercept of the regression line for that sample.
  3. Repeat steps 1 and 2 many times to compute many slopes and many intercepts.
  4. For a given $x$, use the bootstrapped slopes and intercepts to create bootstrapped predictions, and take the middle 95% of them.

The resulting interval will be called a prediction interval.

Bootstrapping the scatter plot of mother/son heights

Note that each time we run this cell, the resulting line is slighty different!

Bootstrapping predictions: mother/son heights

If a mother is 68 inches tall, how tall do we predict her son to be?

Using the original dataset, and hence the original slope and intercept, we get a single prediction for the input of 68.

Using the bootstrapped slopes and intercepts, we get an interval of predictions for the input of 68.

How different could our prediction have been, for all inputs?

Here, we'll plot several of our bootstrapped lines. What do you notice?

Observations:

Prediction interval width vs. mother's height

Note that the closer a mother's height is to the mean mother's height, the narrower the prediction interval for her son's height is!

Summary, next time

Summary

Next time