Lecture 24 – Correlation

DSC 10, Fall 2022

Announcements

Agenda

Recap: Statistical inference

Four big ideas in statistical inference

Every statistical test and simulation we've run in the second half of the class is related to one of the following four ideas. To solidify your understanding of what we've done, it's a good idea to review past lectures and assignments and see how what we did in each section relates to one of these four ideas.

Recent events

Questions to think about:

Association

Prediction

Association

Example: Hybrid cars 🚗

'acceleration' and 'price'

Is there an association between these two variables? If so, what kind?

'mpg' and 'price'

Is there an association between these two variables? If so, what kind?

Observations:

Linear changes in units

Converting columns to standard units

Standard units for hybrid cars

For a given pair of variables:

'acceleration' and 'price'

Which cars have 'acceleration's and 'price's that are more than 2 SDs above average?

'mpg' and 'price'

Which cars have close to average 'mpg's and close to average 'price's?

Observation on associations in standard units

When there is a positive association, most data points fall in the lower left and upper right quadrants.

When there is a negative association, most data points fall in the upper left and lower right quadrants.

Correlation

Definition: Correlation coefficient

The correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

If x and y are two Series or arrays,

r = (x_su * y_su).mean()

where x_su and y_su are x and y converted to standard units.

Let's calculate $r$ for 'acceleration' and 'price'.

Note that the correlation is positive, and most data points fall in the lower left and upper right quadrants!

Let's now calculate $r$ for 'mpg' and 'price'.

Note that the correlation is negative, and most data points fall in the upper left and lower right quadrants!

The correlation coefficient, $r$

Concept Check ✅ – Answer at cc.dsc10.com

Which of the following does the scatter plot below show?

Click here to see the answer after trying it yourself. B. Association but not correlation Since there is a pattern in the scatter plot of $x$ and $y$, there is an association between $x$ and $y$. However, correlation refers to linear association, and there is no linear association between $x$ and $y$. The relationship between $x$ and $y$ is actually $y = x^2$. Even though the association between $x$ and $y$ is very strong, the association cannot be described by a linear function because as $x$ increases, $y$ first decreases, and then increases. The correlation ($r$) between $x$ and $y$ is 0 – try to calculate it yourself!

Regression

Example: Predicting heights 👪 📏

The data below was collected in the late 1800s by Francis Galton.

Mothers and sons 👵👨

Let's just consider the relationship between mothers' heights and their adult sons' heights.

Predicting a son's height based on his mother's height

Many possible ways to make predictions

Better predictions

The regression line

Making predictions in standard units

Making predictions in original units

Of course, we'd like to be able to predict a son's height in inches, not just in standard units. Given a mother's height in inches, here's how we'll predict her son's height in inches:

  1. Convert the mother's height from inches to standard units.
$$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$
  1. Multiply by the correlation coefficient to predict the son's height in standard units.
$$\text{predicted } y_{i \: \text{(su)}} = r \cdot x_{i \: \text{(su)}}$$
  1. Convert the son's predicted height from standard units back to inches.
$$\text{predicted } y_i = \text{predicted } y_{i \: \text{(su)}} \cdot \text{SD of $y$} + \text{mean of $y$}$$

Concept Check ✅ – Answer at cc.dsc10.com

A course has a midterm (mean 80, standard deviation 15) and a really hard final (mean 50, standard deviation 12).

If the scatter plot comparing midterm & final scores for students looks linearly associated with correlation 0.75, then what is the predicted final exam score for a student who received a 90 on the midterm?

Summary, next time

Summary

Next time

More on regression, including: