In [1]:

```
# Run this cell to set up packages for lecture.
from lec23_imports import *
```

- Make a
**1-on-1 appointment with a tutor**to review Quiz 5 or prepare for the Final Exam.- Sessions are open to everyone on a first-come, first-served basis.
- Sign up here.

- Homework 6 is due
**tomorrow at 11:59PM**. **Extra practice session is on Friday**.- Problems will be posted on practice.dsc10.com before then.

- Quiz 6 is
**Monday in discussion**.- It covers Lectures 21-24 (starting with Permutation Testing).
- Practice by solving relevant problems on practice.dsc10.com.

- The Final Project is due
**Tuesday at 11:59PM**.- You can use slip days to extend this deadline. Read the syllabus policy to learn what happens if you use more than 6 slip days.

I am out of town on Friday, so instead of holding lecture as usual, I am posting a recording of Lecture 24 for you to watch asynchronously. Please do not come to class on Friday!

- Recap: Statistical inference.
- Association.
- Correlation.
- Regression.

At a high level, the second half of this class has been about **statistical inference** – using a sample to draw conclusions about the population.

- For the remainder of the quarter, we'll switch our focus to
**prediction**– given a sample, what can I predict about data not in that sample?

- Specifically, we'll focus on
**linear regression**, a prediction technique that tries to find the best "linear relationship" between two numerical variables.

- Along the way, we'll address another idea –
**correlation**.

- Suppose we have a dataset with at least two numerical variables.

- We're interested in
**predicting**one variable based on another:- Given my education level, what is my income?
- Given my height, how tall will my kid be as an adult?
- Given my age, how many countries have I visited?

- To do this effectively, we need to first observe a pattern between the two numerical variables.

- To see if a pattern exists, we'll need to draw a scatter plot.

- An
**association**is any relationship or link 🔗 between two variables in a**scatter plot**. Associations can be linear or non-linear.

In [2]:

```
hybrid = bpd.read_csv('data/hybrid.csv')
hybrid
```

Out[2]:

vehicle | year | price | acceleration | mpg | class | |
---|---|---|---|---|---|---|

0 | Prius (1st Gen) | 1997 | 24509.74 | 7.46 | 41.26 | Compact |

1 | Tino | 2000 | 35354.97 | 8.20 | 54.10 | Compact |

2 | Prius (2nd Gen) | 2000 | 26832.25 | 7.97 | 45.23 | Compact |

... | ... | ... | ... | ... | ... | ... |

150 | C-Max Energi Plug-in | 2013 | 32950.00 | 11.76 | 43.00 | Midsize |

151 | Fusion Energi Plug-in | 2013 | 38700.00 | 11.76 | 43.00 | Midsize |

152 | Chevrolet Volt | 2013 | 39145.00 | 11.11 | 37.00 | Compact |

153 rows × 6 columns

`'price'`

vs. `'acceleration'`

¶Is there an association between these two variables? If so, what kind?

(Note: When looking at a scatter plot, we often describe it in the form "$y$ vs. $x$.")

In [3]:

```
hybrid.plot(kind='scatter', x='acceleration', y='price');
```

`'price'`

vs. `'mpg'`

¶Is there an association between these two variables? If so, what kind?

In [4]:

```
hybrid.plot(kind='scatter', x='mpg', y='price');
```

- There is a negative association – cars with better fuel economy tended to be cheaper.
- Why do we think that's the case?
- Is this always the case today, with the advent of expensive electric cars?

- The association looks more curved than linear.
- It may roughly follow $y \approx \frac{1}{x}$.

Just for fun, we can look at an interactive version of the previous plot. Hover over a point to see the name of the corresponding car.

In [5]:

```
px.scatter(hybrid.to_df(), x='mpg', y='price', hover_name='vehicle')
```