Lecture 25 – Regression and Least Squares

DSC 10, Spring 2023

Announcements

Agenda

The regression line in standard units

Example: Predicting heights 👪 📏

Recall, in the last lecture, we aimed to use a mother's height to predict her adult son's height.

Correlation

Recall, the correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

The regression line

The regression line in original units

Reflection

Each time we wanted to predict the height of an adult son given the height of his mother, we had to:

  1. Convert the mother's height from inches to standard units.
  1. Multiply by the correlation coefficient to predict the son's height in standard units.
  1. Convert the son's predicted height from standard units back to inches.

This is inconvenient – wouldn't it be great if we could express the regression line itself in inches?

From standard units to original units

When $x$ and $y$ are in standard units, the regression line is given by

What is the regression line when $x$ and $y$ are in their original units (e.g. inches)?

The regression line in original units

$$\frac{\text{predicted } y - \text{mean of }y}{\text{SD of }y} = r \cdot \frac{x - \text{mean of } x}{\text{SD of }x}$$
$$\boxed{m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, \: \: b = \text{mean of } y - m \cdot \text{mean of } x}$$

Let's implement these formulas in code and try them out.

Below, we compute the slope and intercept of the regression line between mothers' heights and sons' heights (in inches).

So, the regression line is

$$\text{predicted son's height in inches} = 0.365 \cdot \text{mother's height in inches} + 45.858$$

Making predictions

What's the predicted height of a son whose mother is 62 inches tall?

What if the mother is 55 inches tall? 73 inches tall?

Outliers

The effect of outliers on correlation

Consider the dataset below. What is the correlation between $x$ and $y$?

Removing the outlier

Takeaway: Even a single outlier can have a massive impact on the correlation, and hence the regression line. Look for these before performing regression. Always visualize first!

Errors in prediction

Motivation

Example: Without the outlier

We think our regression line is pretty good because most data points are pretty close to the regression line. The red lines are quite short.

Measuring the error in prediction

$$\text{error} = \text{actual value} - \text{prediction}$$

Root mean squared error (RMSE) of the regression line's predictions

First, let's compute the regression line's predictions for the entire dataset.

To find the RMSE, we need to start by finding the errors and squaring them.

Now, we need to find the mean of the squared errors, and take the square root of that. The result is the RMSE of the regression line's predictions.

The RMSE of the regression line's predictions is about 2.2. Is this big or small, relative to the predictions of other lines? 🤔

Root mean squared error (RMSE) in an arbirtrary line's predictions

Let's compute the RMSEs of several different lines on the same dataset.

Finding the "best" prediction line by minimizing RMSE

Aside: minimize

Finding the "best" prediction line by minimizing RMSE

We'll use minimize on rmse, to find the slope and intercept of the line with the smallest RMSE.

Do these numbers look familiar?

Coincidence?

The slopes and intercepts we got using both approaches look awfully similar... 👀

The regression line is the best line!

$$m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}$$$$b = \text{mean of } y - m \cdot \text{mean of } x$$

Quality of fit

Example: Non-linear data

What's the regression line for this dataset?

This line doesn't fit the data at all!

Residuals

$$\text{error} = \text{actual } y - \text{predicted } y$$
$$\text{residual} = \text{actual } y - \text{predicted } y \text{ by regression line}$$

Summary, next time

Summary

$$m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}$$$$b = \text{mean of } y - m \cdot \text{mean of } x$$

Next time