Lecture 25 – Regression and Least Squares

DSC 10, Fall 2022

Announcements

Agenda

The regression line, in standard units

Example: Predicting heights 👪 📏

Recall, in the last lecture, we aimed to use a mother's height to predict her adult son's height.

Correlation

Recall, the correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

The regression line

Making predictions in standard units

Making predictions in original units

Of course, we'd like to be able to predict a son's height in inches, not just in standard units. Given a mother's height in inches, here's how we'll predict her son's height in inches:

  1. Convert the mother's height from inches to standard units.
$$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$
  1. Multiply by the correlation coefficient to predict the son's height in standard units.
$$\text{predicted } y_{i \: \text{(su)}} = r \cdot x_{i \: \text{(su)}}$$
  1. Convert the son's predicted height from standard units back to inches.
$$\text{predicted } y_i = \text{predicted } y_{i \: \text{(su)}} \cdot \text{SD of $y$} + \text{mean of $y$}$$

Concept Check ✅ – Answer at cc.dsc10.com

A course has a midterm (mean 80, standard deviation 15) and a really hard final (mean 50, standard deviation 12).

If the scatter plot comparing midterm & final scores for students looks linearly associated with correlation 0.75, then what is the predicted final exam score for a student who received a 90 on the midterm?

The regression line, in original units

Reflection

Each time we wanted to predict the height of an adult son given the height of a mother, we had to:

  1. Convert the mother's height from inches to standard units.
  1. Multiply by the correlation coefficient to predict the son's height in standard units.
  1. Convert the son's predicted height from standard units back to inches.

This is inconvenient – wouldn't it be great if we could express the regression line itself in inches?

From standard units to original units

When $x$ and $y$ are in standard units, the regression line is given by

What is the regression line when $x$ and $y$ are in their original units?

The regression line in original units

$$\frac{\text{predicted } y - \text{mean of }y}{\text{SD of }y} = r \cdot \frac{x - \text{mean of } x}{\text{SD of }x}$$
$$m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, \: \: b = \text{mean of } y - m \cdot \text{mean of } x$$

Let's implement these formulas in code and try them out.

Below, we compute the slope and intercept of the regression line between mothers' heights and sons' heights (in inches).

So, the regression line is

$$\text{predicted son's height} = 0.365 \cdot \text{mother's height} + 45.858$$

Making predictions

What's the predicted height of a son whose mother is 62 inches tall?

What if the mother is 55 inches tall? 73 inches tall?

Outliers

The effect of outliers on correlation

Consider the dataset below. What is the correlation between $x$ and $y$?

Removing the outlier

Takeaway: Even a single outlier can have a massive impact on the correlation, and hence the regression line. Look for these before performing regression. Always visualize first!

Errors in prediction

Motivation

Example: Without the outlier

We think our regression line is pretty good because most data points are pretty close to the regression line. The red lines are quite short.

Measuring the error in prediction

$$\text{error} = \text{actual value} - \text{prediction}$$

Root mean squared error (RMSE) of the regression line's predictions

The RMSE of the regression line's predictions is about 2.2. Is this big or small, relative to the predictions of other lines? 🤔

Root mean squared error (RMSE) in an arbirtrary line's predictions

Let's compute the RMSEs of several different lines on the same dataset.

Finding the "best" prediction line by minimizing RMSE

Aside: minimize

Finding the "best" prediction line by minimizing RMSE

We'll use minimize on rmse, to find the slope and intercept of the line with the smallest RMSE.

Do these numbers look familiar?

Coincidence?

The slopes and intercepts we got using both approaches look awfully similar... 👀

The regression line is the best line!

$$m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}$$$$b = \text{mean of } y - m \cdot \text{mean of } x$$

Quality of fit

Example: Non-linear data

What's the regression line for this dataset?

This line doesn't fit the data at all!

Summary, next time

Summary

$$m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}$$$$b = \text{mean of } y - m \cdot \text{mean of } x$$

Next time