In [1]:

```
# Run this cell to set up packages for lecture.
from lec25_imports import *
```

- Quiz 6 is
**today in discussion**.- It covers Lectures 21-24 (starting with Permutation Testing).
- Practice with the problems here.

- The Final Project is due
**tomorrow at 11:59PM**.- If one or both partners has run out of slip days and you submit the project late, we will reallocate slip days towards the final project, away from lesser-weighted assignments. See the syllabus for more details.

- Lab 7 is due on
**Thursday at 11:59PM**.- Even if you don't need to do this lab for your grade, it's the only assignment on regression, which will be tested on the Final Exam.

- The Final Exam is
**this Saturday 3/16 from 7-10PM**. More details to come.- Collaborative study session on Friday 3/15 from 5-8PM in Solis 104.

- If at least 75% of the class fills out both SETs and the internal End-of-Quarter Survey, then the entire class will have
**1% of extra credit added to their overall grade**. We value your feedback! **Today is the last day of new material. The next two days are for review!**- We'll be working through the Fall 2023 Final Exam on Wednesday. Read the data description and attempt the problems on your own before then!

- Residuals.
- Inference for regression.

- The regression line describes the "best linear fit" for a given dataset.
- The formulas for the slope and intercept work no matter what the shape of the data is.
- However, the line is only meaningful if the relationship between $x$ and $y$ is roughly linear.

In [2]:

```
non_linear()
```

This line doesn't fit the data at all, despite being the "best" line for the data!

- Any set of predictions has
*errors*.

- When using the regression line to make predictions, the errors are called
**residuals**.

- There is one residual corresponding to each data point $(x, y)$ in the dataset.

In [3]:

```
def predicted(df, x, y):
m = slope(df, x, y)
b = intercept(df, x, y)
return m * df.get(x) + b
def residual(df, x, y):
return df.get(y) - predicted(df, x, y)
```

Is the association between `'mom'`

and `'son'`

linear?

- If there is a linear association, is it strong?
- We can answer this using the correlation coefficient.
- Close to 0 = weak, close to -1/+1 = strong.

- Is "linear" the best description of the association between
`'mom'`

and`'son'`

?**We'll use residuals to answer this question.**

In [4]:

```
galton = bpd.read_csv('data/galton.csv')
male_children = galton[galton.get('gender') == 'male']
mom_son = bpd.DataFrame().assign(mom = male_children.get('mother'),
son = male_children.get('childHeight'))
mom_son_predictions = mom_son.assign(predicted=predicted(mom_son, 'mom', 'son'),
residuals=residual(mom_son, 'mom', 'son'),
)
plot_regression_line(mom_son_predictions, 'mom', 'son', resid=True)
```

Correlation: 0.3230049836849053

The residual plot of a regression line is the scatter plot with the $x$ variable on the $x$-axis and residuals on the $y$-axis.

$$\text{residual} = \text{actual } y - \text{predicted } y \text{ by regression line}$$

- Residual plots describe how the error in the regression line's predictions varies.

**Key idea: If a linear fit is good, the residual plot should look like a patternless "cloud" ☁️.**

In [5]:

```
mom_son_predictions.plot(kind='scatter', x='mom', y='residuals', s=50, c='purple', figsize=(10, 5), label='residuals')
plt.axhline(0, linewidth=3, color='k', label='y = 0')
plt.title('Residual plot for predicting son\'s height based on mother\'s height')
plt.legend();
```

- Consider the hybrid cars dataset from earlier.
- Let's look at a regression line that uses
`'mpg'`

to predict`'price'`

.

In [6]:

```
hybrid = bpd.read_csv('data/hybrid.csv')
mpg_price = hybrid.assign(
predicted=predicted(hybrid, 'mpg', 'price'),
residuals=residual(hybrid, 'mpg', 'price')
)
mpg_price
```

Out[6]:

vehicle | year | price | acceleration | mpg | class | predicted | residuals | |
---|---|---|---|---|---|---|---|---|

0 | Prius (1st Gen) | 1997 | 24509.74 | 7.46 | 41.26 | Compact | 32609.64 | -8099.90 |

1 | Tino | 2000 | 35354.97 | 8.20 | 54.10 | Compact | 19278.39 | 16076.58 |

2 | Prius (2nd Gen) | 2000 | 26832.25 | 7.97 | 45.23 | Compact | 28487.75 | -1655.50 |

... | ... | ... | ... | ... | ... | ... | ... | ... |

150 | C-Max Energi Plug-in | 2013 | 32950.00 | 11.76 | 43.00 | Midsize | 30803.06 | 2146.94 |

151 | Fusion Energi Plug-in | 2013 | 38700.00 | 11.76 | 43.00 | Midsize | 30803.06 | 7896.94 |

152 | Chevrolet Volt | 2013 | 39145.00 | 11.11 | 37.00 | Compact | 37032.62 | 2112.38 |

153 rows × 8 columns

In [7]:

```
# Plot of the original data and regression line.
plot_regression_line(hybrid, 'mpg', 'price');
print('Correlation:', calculate_r(hybrid, 'mpg', 'price'))
```

Correlation: -0.5318263633683786

In [8]:

```
# Residual plot.
mpg_price.plot(kind='scatter', x='mpg', y='residuals', figsize=(10, 5), s=50, color='purple', label='residuals')
plt.axhline(0, linewidth=3, color='k', label='y = 0')
plt.title('Residual plot for regression between mpg and price')
plt.legend();
```

`'mpg'`

increases, the residuals go from being mostly large, to being mostly small, to being mostly large again. That's a pattern!

- Patterns in the residual plot imply that the relationship between $x$ and $y$ may not be linear.
- While this can be spotted in the original scatter plot, it may be easier to identify in the residual plot.

- In such cases, a curve may be a better choice than a line for prediction.
- In future courses, you'll learn how to extend linear regression to work for polynomials and other types of mathematical functions.

`'mpg'`

and `'acceleration'`

⛽¶- Let's fit a regression line that predicts
`'mpg'`

given`'acceleration'`

. - Let's then look at the resulting residual plot.

In [9]:

```
accel_mpg = hybrid.assign(
predicted=predicted(hybrid, 'acceleration', 'mpg'),
residuals=residual(hybrid, 'acceleration', 'mpg')
)
accel_mpg
```

Out[9]:

vehicle | year | price | acceleration | mpg | class | predicted | residuals | |
---|---|---|---|---|---|---|---|---|

0 | Prius (1st Gen) | 1997 | 24509.74 | 7.46 | 41.26 | Compact | 43.29 | -2.03 |

1 | Tino | 2000 | 35354.97 | 8.20 | 54.10 | Compact | 41.90 | 12.20 |

2 | Prius (2nd Gen) | 2000 | 26832.25 | 7.97 | 45.23 | Compact | 42.33 | 2.90 |

... | ... | ... | ... | ... | ... | ... | ... | ... |

150 | C-Max Energi Plug-in | 2013 | 32950.00 | 11.76 | 43.00 | Midsize | 35.17 | 7.83 |

151 | Fusion Energi Plug-in | 2013 | 38700.00 | 11.76 | 43.00 | Midsize | 35.17 | 7.83 |

152 | Chevrolet Volt | 2013 | 39145.00 | 11.11 | 37.00 | Compact | 36.40 | 0.60 |

153 rows × 8 columns

In [10]:

```
# Plot of the original data and regression line.
plot_regression_line(accel_mpg, 'acceleration', 'mpg')
print('Correlation:', calculate_r(accel_mpg, 'acceleration', 'mpg'))
```

Correlation: -0.5060703843771186

In [11]:

```
# Residual plot.
accel_mpg.plot(kind='scatter', x='acceleration', y='residuals', figsize=(10, 5), s=50, color='purple', label='residuals')
plt.axhline(0, linewidth=3, color='k', label='y = 0')
plt.title('Residual plot for regression between acceleration and mpg')
plt.legend();
```

**not** similar at all points on the $x$-axis.

- If the vertical spread in a residual plot is uneven, it implies that the regression line's predictions aren't equally reliable for all inputs.
- This doesn't necessarily mean that fitting a non-linear curve would be better; it just impacts how we interpret the regression line's predictions.
- For instance, in the previous plot, we see that the regression line's predictions for cars with larger accelerations are more reliable than those for cars with lower accelerations.

- The formal term for "uneven spread" is
**heteroscedasticity**.

- All 4 datasets have the same mean of $x$, mean of $y$, SD of $x$, SD of $y$, and correlation.
- Therefore, they have the same regression line because the slope and intercept of the regression line are determined by these 5 quantities.

- But they all look very different! Not all of them contain linear associations.

In [12]:

```
dino = bpd.read_csv('data/Datasaurus_data.csv')
dino
```

Out[12]:

x | y | |
---|---|---|

0 | 55.38 | 97.18 |

1 | 51.54 | 96.03 |

2 | 46.15 | 94.49 |

... | ... | ... |

139 | 50.00 | 95.77 |

140 | 47.95 | 95.00 |

141 | 44.10 | 92.69 |

142 rows × 2 columns

In [13]:

```
calculate_r(dino, 'x', 'y')
```

Out[13]:

-0.06447185270095163

In [14]:

```
slope(dino, 'x', 'y')
```

Out[14]:

-0.10358250243265595

In [15]:

```
intercept(dino, 'x', 'y')
```

Out[15]:

53.452978449229235

In [16]:

```
plot_regression_line(dino, 'x', 'y');
```