In [1]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (10, 5)

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Animations
from IPython.display import display
import ipywidgets as widgets

import warnings
warnings.filterwarnings('ignore')

# Demonstration code
def r_scatter(r):
    "Generate a scatter plot with a correlation approximately r"
    x = np.random.normal(0, 1, 1000)
    z = np.random.normal(0, 1, 1000)
    y = r * x + (np.sqrt(1 - r ** 2)) * z
    plt.scatter(x, y)
    plt.xlim(-4, 4)
    plt.ylim(-4, 4)
    
def show_scatter_grid():
    plt.subplots(1, 4, figsize=(10, 2))
    for i, r in enumerate([-1, -2/3, -1/3, 0]):
        plt.subplot(1, 4, i+1)
        r_scatter(r)
        plt.title(f'r = {np.round(r, 2)}')
    plt.show()
    plt.subplots(1, 4, figsize=(10, 2))
    for i, r in enumerate([1, 2/3, 1/3]):
        plt.subplot(1, 4, i+1)
        r_scatter(r)
        plt.title(f'r = {np.round(r, 2)}')
    plt.subplot(1, 4, 4)
    plt.axis('off')
    plt.show()

Lecture 24 – Correlation¶

DSC 10, Spring 2023¶

Announcements¶

  • Lab 7 is due on Saturday 6/3 at 11:59PM.
  • The Final Project is due on Tuesday 6/6 at 11:59PM.
    • Issues saving your Final Project notebook? Watch this video!

Agenda¶

  • Association.
  • Correlation.
  • Regression.

Remember to review the end of Lecture 23 for a high-level summary of the second half of the class so far.

Association¶

Prediction¶

  • Suppose we have a dataset with at least two numerical variables.
  • We're interested in predicting one variable based on another:
    • Given my education level, what is my income?
    • Given my height, how tall will my kid be as an adult?
    • Given my age, how many countries have I visited?
  • To do this effectively, we need to first observe a pattern between the two numerical variables.
  • To see if a pattern exists, we'll need to draw a scatter plot.

Association¶

  • An association is any relationship or link 🔗 between two variables in a scatter plot. Associations can be linear or non-linear.
  • If two variables have a positive association ↗️, then as one variable increases, the other tends to increase.
  • If two variables have a negative association ↘️, then as one variable increases, the other tends to decrease.
  • If two variables are associated, then we can predict the value of one variable based on the value of the other.

Example: Hybrid cars 🚗¶

In [2]:
hybrid = bpd.read_csv('data/hybrid.csv')
hybrid
Out[2]:
vehicle year price acceleration mpg class
0 Prius (1st Gen) 1997 24509.74 7.46 41.26 Compact
1 Tino 2000 35354.97 8.20 54.10 Compact
2 Prius (2nd Gen) 2000 26832.25 7.97 45.23 Compact
... ... ... ... ... ... ...
150 C-Max Energi Plug-in 2013 32950.00 11.76 43.00 Midsize
151 Fusion Energi Plug-in 2013 38700.00 11.76 43.00 Midsize
152 Chevrolet Volt 2013 39145.00 11.11 37.00 Compact

153 rows × 6 columns

'price' vs. 'acceleration'¶

Is there an association between these two variables? If so, what kind?

In [3]:
hybrid.plot(kind='scatter', x='acceleration', y='price');
2023-05-30T18:52:45.139419 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

'price' vs. 'mpg'¶

Is there an association between these two variables? If so, what kind?

In [4]:
hybrid.plot(kind='scatter', x='mpg', y='price');
2023-05-30T18:52:45.255637 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Observations:

  • There is a negative association – cars with better fuel economy tended to be cheaper.
    • Why do we think that is? 🤔
  • The association looks more curved than linear.
    • It may roughly follow $y \approx \frac{1}{x}$.

Linear changes in units¶

  • A linear change in units doesn't change the shape of the plot, it only changes the scale of the plot.
    • Linear change means adding or subtracting a constant, and multiplying or dividing by a constant.
  • In other words, instead of plotting price in dollars and fuel economy in miles per gallon, we can plot price in Yen (🇯🇵) and fuel economy in kilometers per liter and the plot would look the same, just with different axes:
In [5]:
hybrid.assign(
    km_per_liter=hybrid.get('mpg') * 0.425144,
    yen=hybrid.get('price') * 139.77 
).plot(kind='scatter', x='km_per_liter', y='yen');
2023-05-30T18:52:45.390312 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Converting columns to standard units¶

  • Recall: Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. To convert $x_i$ to standard units, $$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$
  • Converting columns to standard units makes different scatter plots comparable, by putting the $x$ and $y$ axes on the same scale.
    • Both axes measure the number of standard deviations above the mean.
  • Converting columns to standard units doesn't change shape of the scatter plot, because the conversion is linear.
In [6]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    any_numbers = np.array(any_numbers)
    return (any_numbers - any_numbers.mean()) / np.std(any_numbers)
In [7]:
def standardize(df):
    """Return a DataFrame in which all columns of df are converted to standard units."""
    df_su = bpd.DataFrame()
    for column in df.columns:
        df_su = df_su.assign(**{column + ' (su)': standard_units(df.get(column))})
    return df_su

Standard units for hybrid cars¶

For a given pair of variables:

  • Which cars are average in both variables?
  • Which cars are well above or well below average in both variables?
In [8]:
hybrid_su = standardize(hybrid.get(['price', 'acceleration', 'mpg'])).assign(vehicle=hybrid.get('vehicle'))
hybrid_su
Out[8]:
price (su) acceleration (su) mpg (su) vehicle
0 -6.94e-01 -1.54 0.59 Prius (1st Gen)
1 -1.86e-01 -1.28 1.76 Tino
2 -5.85e-01 -1.36 0.95 Prius (2nd Gen)
... ... ... ... ...
150 -2.98e-01 -0.07 0.75 C-Max Energi Plug-in
151 -2.90e-02 -0.07 0.75 Fusion Energi Plug-in
152 -8.17e-03 -0.29 0.20 Chevrolet Volt

153 rows × 4 columns

'price' vs. 'acceleration'¶

In [9]:
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='price (su)');
2023-05-30T18:52:45.546060 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Which cars have 'acceleration's and 'price's that are more than 2 SDs above average?

In [10]:
hybrid_su[(hybrid_su.get('acceleration (su)') > 2) &
          (hybrid_su.get('price (su)') > 2)]
Out[10]:
price (su) acceleration (su) mpg (su) vehicle
47 2.71 2.05 -1.46 ActiveHybrid X6
60 3.04 2.88 -1.16 ActiveHybrid 7
95 2.96 2.12 -1.35 ActiveHybrid 7i
146 2.11 2.12 -0.90 ActiveHybrid 7L
147 2.66 2.24 -0.90 Panamera S

'price' vs. 'mpg'¶

In [11]:
hybrid_su.plot(kind='scatter', x='mpg (su)', y='price (su)');
2023-05-30T18:52:45.678898 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Which cars have close to average 'mpg's and close to average 'price's?

In [12]:
hybrid_su[(hybrid_su.get('mpg (su)') <= 0.3) &
          (hybrid_su.get('mpg (su)') >= -0.3) &
          (hybrid_su.get('price (su)') <= 0.3) &
          (hybrid_su.get('price (su)') >= -0.3)]
Out[12]:
price (su) acceleration (su) mpg (su) vehicle
10 -1.24e-01 -0.56 -0.26 Escape
22 -2.13e-01 -1.02 -0.17 Mercury Mariner
57 -8.47e-02 0.72 -0.11 Audi Q5
... ... ... ... ...
70 -2.14e-01 -0.07 0.02 HS 250h
102 -2.69e-03 -0.29 0.20 Chevrolet Volt
152 -8.17e-03 -0.29 0.20 Chevrolet Volt

8 rows × 4 columns

Observation on associations in standard units¶

  • If two variables are positively associated ↗️,
    • when one variable is positive, the other tends to be positive, and
    • when one variable is negative, the other also tends to be negative.
  • If two variables are negatively associated ↘️,
    • when one variable is positive, the other tends to be negative, and vice versa.
  • If two variables aren't associated, there should be no such pattern.

Correlation¶

Definition: Correlation coefficient¶

The correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

  • average value of the
  • product of $x$ and $y$
  • when both are measured in standard units.

If x and y are two Series or arrays,

r = (x_su * y_su).mean()

where x_su and y_su are x and y converted to standard units.

In [13]:
def calculate_r(df, x, y):
    x_su = df.get(x + ' (su)')
    y_su = df.get(y + ' (su)')
    return (x_su * y_su).mean()

Let's calculate $r$ for 'acceleration' and 'price'.

In [14]:
hybrid_su
Out[14]:
price (su) acceleration (su) mpg (su) vehicle
0 -6.94e-01 -1.54 0.59 Prius (1st Gen)
1 -1.86e-01 -1.28 1.76 Tino
2 -5.85e-01 -1.36 0.95 Prius (2nd Gen)
... ... ... ... ...
150 -2.98e-01 -0.07 0.75 C-Max Energi Plug-in
151 -2.90e-02 -0.07 0.75 Fusion Energi Plug-in
152 -8.17e-03 -0.29 0.20 Chevrolet Volt

153 rows × 4 columns

In [15]:
r_acc_price = calculate_r(hybrid_su, 'acceleration', 'price')
r_acc_price
Out[15]:
0.6955778996913982
In [16]:
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='price (su)')
plt.axvline(0, color='black');
plt.axhline(0, color='black');
2023-05-30T18:52:45.840932 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Note that the correlation is positive, and most data points fall in the lower left and upper right quadrants!

Let's now calculate $r$ for 'mpg' and 'price'.

In [17]:
hybrid_su
Out[17]:
price (su) acceleration (su) mpg (su) vehicle
0 -6.94e-01 -1.54 0.59 Prius (1st Gen)
1 -1.86e-01 -1.28 1.76 Tino
2 -5.85e-01 -1.36 0.95 Prius (2nd Gen)
... ... ... ... ...
150 -2.98e-01 -0.07 0.75 C-Max Energi Plug-in
151 -2.90e-02 -0.07 0.75 Fusion Energi Plug-in
152 -8.17e-03 -0.29 0.20 Chevrolet Volt

153 rows × 4 columns

In [18]:
r_mpg_price = calculate_r(hybrid_su, 'mpg', 'price')
r_mpg_price
Out[18]:
-0.5318263633683789
In [19]:
hybrid_su.plot(kind='scatter', x='mpg (su)', y='price (su)');
plt.axvline(0, color='black');
plt.axhline(0, color='black');
2023-05-30T18:52:45.987524 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Note that the correlation is negative, and most data points fall in the upper left and lower right quadrants!

The correlation coefficient, $r$¶

  • $r$ measures how clustered points are around a straight line – it measures linear association.
    • If two variables are correlated, it means they are linearly associated.
  • $r$ is always between $-1$ and $1$.
    • If $r = 1$, the scatter plot is a line of slope 1.
    • If $r = -1$, the scatter plot is a line of slope -1.
    • If $r = 0$, there is no linear association (uncorrelated).
In [20]:
show_scatter_grid()
2023-05-30T18:52:46.321514 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-05-30T18:52:46.746216 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
  • $r$ is computed based on standard units.
    • The correlation between price in dollars and fuel economy in miles per gallon is the same as the correlation between price in Yen and fuel economy in kilometers per liter.
  • $r$ quantifies how well we can predict one variable using the other.
    • If $r$ is close to $1$ or $-1$ we can predict one variable from the other quite accurately.
    • If $r$ is close to $0$, we cannot make good predictions.

Concept Check ✅ – Answer at cc.dsc10.com¶

Which of the following does the scatter plot below show?

  • A. Association and correlation
  • B. Association but not correlation
  • C. Correlation but not association
  • D. Neither association nor correlation
In [21]:
x2 = bpd.DataFrame().assign(
    x=np.arange(-6, 6.1, 0.5), 
    y=np.arange(-6, 6.1, 0.5) ** 2
)
x2.plot(kind='scatter', x='x', y='y');
2023-05-30T18:52:47.018423 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
✅ Click here to see the answer after trying it yourself. B. Association but not correlation Since there is a pattern in the scatter plot of $x$ and $y$, there is an association between $x$ and $y$. However, correlation refers to linear association, and there is no linear association between $x$ and $y$. The relationship between $x$ and $y$ is actually $y = x^2$. Even though the association between $x$ and $y$ is very strong, the association cannot be described by a linear function because as $x$ increases, $y$ first decreases, and then increases. The correlation ($r$) between $x$ and $y$ is 0 – try to calculate it yourself!

Regression¶

Example: Predicting heights 👪 📏¶

The data below was collected in the late 1800s by Francis Galton.

  • He was a eugenicist and proponent of scientific racism, which is why he collected this data.
  • Today, we understand that eugenics is immoral, and that there is no scientific evidence or any other justification for racism.
  • Galton is credited with discovering regression using this data.
In [22]:
galton = bpd.read_csv('data/galton.csv')
galton
Out[22]:
family father mother midparentHeight children childNum gender childHeight
0 1 78.5 67.0 75.43 4 1 male 73.2
1 1 78.5 67.0 75.43 4 2 female 69.2
2 1 78.5 67.0 75.43 4 3 female 69.0
... ... ... ... ... ... ... ... ...
931 203 62.0 66.0 66.64 3 3 female 61.0
932 204 62.5 63.0 65.27 2 1 male 66.5
933 204 62.5 63.0 65.27 2 2 female 57.0

934 rows × 8 columns

Mothers and sons 👵👨¶

Let's just consider the relationship between mothers' heights and their adult sons' heights.

In [23]:
male_children = galton[galton.get('gender') == 'male']
mom_son = bpd.DataFrame().assign(mom = male_children.get('mother'), 
                                 son = male_children.get('childHeight'))
mom_son
Out[23]:
mom son
0 67.0 73.2
4 66.5 73.5
5 66.5 72.5
... ... ...
925 60.0 66.0
929 66.0 64.0
932 63.0 66.5

481 rows × 2 columns

In [24]:
mom_son.plot(kind='scatter', x='mom', y='son');
2023-05-30T18:52:47.235112 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Predicting a son's height based on his mother's height¶

  • The scatter plot demonstrates a positive association between a mother's height ('mom') and her son's height ('son').
  • Let's quantify how linear that association is by computing the correlation between 'mom' and 'son'.
  • First, we standardize the data.
In [25]:
mom_son_su = standardize(mom_son)
mom_son_su.plot(kind='scatter', x='mom (su)', y='son (su)');
2023-05-30T18:52:47.402091 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [26]:
r_mom_son = calculate_r(mom_son_su, 'mom', 'son')
r_mom_son
Out[26]:
0.32300498368490554

Many possible ways to make predictions¶

  • We want a simple strategy, or rule, for predicting a son's height.
  • The simplest possible prediction strategy just predicts the same value for every son's height, regardless of his mother's height.
  • Some such predictions are better than others.
In [27]:
def constant_prediction(prediction):
    mom_son_su.plot(kind='scatter', x='mom (su)', y='son (su)', title=f'Predicting a height of {prediction} SUs for all sons', figsize=(10, 5));
    plt.axhline(prediction, color='orange', lw=4);
    plt.xlim(-3, 3)
    plt.show()

prediction = widgets.FloatSlider(value=-3, min=-3,max=3,step=0.5, description='prediction')
ui = widgets.HBox([prediction])
out = widgets.interactive_output(constant_prediction, {'prediction': prediction})
display(ui, out)
HBox(children=(FloatSlider(value=-3.0, description='prediction', max=3.0, min=-3.0, step=0.5),))
Output()
  • Which of these predictions is the best?
    • It depends on what we mean by "best," but a natural choice is the rule that predicts 0 standard units, because this corresponds to the mean height of all sons.
In [28]:
mom_son_su.plot(kind='scatter', x='mom (su)', y='son (su)', title='A good prediction is the mean height of sons (0 SUs)', figsize=(10, 5));
plt.axhline(0, color='orange', lw=4);
plt.xlim(-3, 3);
2023-05-30T18:52:47.785807 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Better predictions¶

  • Since there is linear association between a son's height and his mother's height, we can make better predictions by allowing our predictions to vary with the mother's height.
  • The simplest way to do this uses a line to make predictions.
  • As before, some lines are better than others.
In [29]:
def linear_prediction(slope):
    x = np.linspace(-3, 3)
    y = x * slope
    mom_son_su.plot(kind='scatter', x='mom (su)', y='son (su)', figsize=(10, 5));
    plt.plot(x, y, color='orange', lw=4)
    plt.xlim(-3, 3)
    plt.title(r"Predicting sons' heights using $\mathrm{son}_{\mathrm{(su)}}$ = " + str(np.round(slope, 2)) + r"$ \cdot \mathrm{mother}_{\mathrm{(su)}}$")
    plt.show()

slope = widgets.FloatSlider(value=0, min=-1,max=1,step=1/6, description='slope')
ui = widgets.HBox([slope])
out = widgets.interactive_output(linear_prediction, {'slope': slope})
display(ui, out)
HBox(children=(FloatSlider(value=0.0, description='slope', max=1.0, min=-1.0, step=0.16666666666666666),))
Output()
  • Which of these lines is the best?
    • Again, it depends what we mean by "best," but a good choice is the line that goes through the origin and has a slope of $r$.
    • This line is called the regression line, and we'll see next time that it is the "best" line for making predictions in a certain sense.
In [30]:
x = np.linspace(-3, 3)
y = x * r_mom_son
mom_son_su.plot(kind='scatter', x='mom (su)', y='son (su)', title=r'A good line goes through the origin and has slope $r$', figsize=(10, 5));
plt.plot(x, y, color='orange', label='regression line', lw=4)
plt.xlim(-3, 3)
plt.legend();
2023-05-30T18:52:48.521323 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

The regression line¶

  • The regression line is the line through $(0,0)$ with slope $r$, when both variables are measured in standard units.
  • We use the regression line to make predictions!

Making predictions in standard units¶

  • If $r = 0.32$, and the given $x$ is $2$ in standard units, then the prediction for $y$ is $0.64$ standard units.
    • The regression line predicts that a mother whose height is $2$ SDs above average has a son whose height is $0.64$ SDs above average.
  • If $r = 0.32$, and the given $x$ is $-1$ in standard units, then the prediction for $y$ is $-0.32$ standard units.
  • We always predict that a son will be somewhat closer to average in height than his mother.
    • This is a consequence of the slope $r$ having magnitude less than 1.
    • This effect is called regression to the mean.
  • The regression line passes through the origin $(0, 0)$ in standard units. This means that, no matter what $r$ is, for an average $x$ value, we predict an average $y$ value.

Making predictions in original units¶

Of course, we'd like to be able to predict a son's height in inches, not just in standard units. Given a mother's height in inches, here's how we'll predict her son's height in inches:

  1. Convert the mother's height from inches to standard units.
$$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$
  1. Multiply by the correlation coefficient to predict the son's height in standard units.
$$\text{predicted } y_{i \: \text{(su)}} = r \cdot x_{i \: \text{(su)}}$$
  1. Convert the son's predicted height from standard units back to inches.
$$\text{predicted } y_i = \text{predicted } y_{i \: \text{(su)}} \cdot \text{SD of $y$} + \text{mean of $y$}$$

Let's try it!

In [31]:
mom_mean = mom_son.get('mom').mean()
mom_sd = np.std(mom_son.get('mom'))
son_mean = mom_son.get('son').mean()
son_sd = np.std(mom_son.get('son'))
In [32]:
def predict_with_r(mom):
    """Return a prediction for the height of a son whose mother has height mom, 
    using linear regression.
    """
    mom_su = (mom - mom_mean) / mom_sd
    son_su = r_mom_son * mom_su
    return son_su * son_sd + son_mean
In [33]:
predict_with_r(68)
Out[33]:
70.68219686848828
In [34]:
predict_with_r(60)
Out[34]:
67.76170758654767
In [35]:
preds = mom_son.assign(
    predicted_height=mom_son.get('mom').apply(predict_with_r)
)
ax = preds.plot(kind='scatter', x='mom', y='son', title='Regression line predictions, in original units', figsize=(10, 5), label='original data')
preds.plot(kind='line', x='mom', y='predicted_height', ax=ax, color='orange', label='regression line', lw=4);
plt.legend();
2023-05-30T18:52:48.761964 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Concept Check ✅ – Answer at cc.dsc10.com¶

A course has a midterm (mean 80, standard deviation 15) and a really hard final (mean 50, standard deviation 12).

If the scatter plot comparing midterm & final scores for students looks linearly associated with correlation 0.75, then what is the predicted final exam score for a student who received a 90 on the midterm?

  • A. 54
  • B. 56
  • C. 58
  • D. 60
  • E. 62

Summary, next time¶

Summary¶

  • The correlation coefficient, $r$, measures the linear association between two variables $x$ and $y$.
    • It ranges between -1 and 1.
  • When both variables are measured in standard units, the regression line is the straight line passing through $(0, 0)$ with slope $r$. We can use it to make predictions for a $y$ value (e.g. son's height) given an $x$ value (e.g. mother's height).

Next time¶

More on regression, including:

  • What is the equation of the regression line in original units (e.g. inches)?
  • In what sense is the regression line the "best" line for making predictions?