In [1]:

```
# Run this cell to set up packages for lecture.
from lec22_imports import *
```

- Quiz 5 is
**today in discussion section**.- It covers Lectures 18-21 (excluding Permutation Testing).
- Practice with the problems here.

- Homework 6 is due
**Thursday at 11:59PM**. - Start working on the Final Project, due
**Tuesday 3/12 at 11:59PM**. - On Friday, I will not be here and we won't have live lecture. Instead, I will post a recording of Lecture 24 for you to watch asynchronously.

- Permutation testing.
- Are the distributions of weight for babies 👶 born to smoking mothers vs. non-smoking mothers different?
- Are the distributions of pressure drops for footballs 🏈 from two different teams different?

- Standard hypothesis testing answers questions of the form:

I have a population distribution, and I have

one sample. Does this sample look like it was drawn from the population?

- Permutation testing answers questions of the form:

I have

two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

- Standard hypothesis involves a known population distribution, but permutation testing involves an
**unknown population distribution**. How do you determine whether two samples came from the same population distribution, if you don't know what that population distribution is?

In [2]:

```
babies = bpd.read_csv('data/baby.csv').get(['Maternal Smoker', 'Birth Weight'])
babies
```

Out[2]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

... | ... | ... |

1171 | True | 130 |

1172 | False | 125 |

1173 | False | 117 |

1174 rows × 2 columns

- The means of the two groups in our sample are different.

In [3]:

```
babies.groupby('Maternal Smoker').mean()
```

Out[3]:

Birth Weight | |
---|---|

Maternal Smoker | |

False | 123.09 |

True | 113.82 |

In [4]:

```
diff_in_means = (babies.groupby('Maternal Smoker').mean().get('Birth Weight').loc[False] -
babies.groupby('Maternal Smoker').mean().get('Birth Weight').loc[True])
diff_in_means
```

Out[4]:

9.266142572024918

**Question:**Is there a significant difference in the weights of babies born to mothers who smoked vs. babies born to mothers who didn't smoke?

**Null Hypothesis**: In the population, birth weights of smokers' babies and non-smokers' babies have the same distribution, and the observed differences in our samples are due to random chance.

**Alternative Hypothesis**: In the population, smokers' babies have lower birth weights than non-smokers' babies, on average. The observed differences in our samples cannot be explained by random chance alone.

**Test statistic**: Difference in mean birth weight of non-smokers' babies and smokers' babies.

- We need the
**distribution of the test statistic under the assumption the null hypothesis is true**.

- Under the null hypothesis, both groups are sampled from the same population distribution.

- 🚨
**Issue**: We don't have the population distribution, so we can't draw samples from it!

**Idea**: We can construct a "population" by combining both of our samples. Then, to create two random samples from it, we just separate (or split) the population into two random groups.

In [5]:

```
show_permutation_testing_intro()
```

- A
**permutation**of a sequence is a rearrangement of the elements in that sequence.- For example,
`'BAC'`

and`'CAB'`

are both permutations of the string`'ABC'`

. - We create permutations by
**shuffling**.

- For example,

- In the previous animation, we repeatedly split the "population" into two random groups whose sizes were equal to the original samples' sizes.
- In the original non-smokers' sample, there were 7 weights, and in the original smokers' sample, there were 4 weights.
- Each time we created a pair of new samples, we randomly chose 7 weights to be part of the
*new*non-smokers' sample, and the other 4 weights to be part of the*new*smokers' sample.

**Key idea:**To randomly assign weights to groups, in a way that preserves the sizes of the groups, we can just shuffle the`'Maternal Smoker'`

column of`babies`

!

A permutation test is a type of A/B test (and a type of hypothesis test). It tests whether two samples come from the same population distribution. To conduct a permutation test:

- Shuffle the group labels (i.e. the
`True`

s and`False`

s) to generate two new samples under the null. These two new samples have the same sizes as the original samples.

- Compute the difference in group means (the test statistic).

- Repeat steps 1 and 2 to generate an
**empirical distribution of the difference in group means**.

- We want to randomly shuffle just the
`'Maternal Smoker'`

column in the`babies`

DataFrame.

`df.sample`

returns a random sample of the rows in a DataFrame, but we want to shuffle one column independently.

In [6]:

```
data = bpd.DataFrame().assign(x=['a', 'b', 'c', 'd', 'e'], y=[1, 2, 3, 4, 5])
data
```

Out[6]:

x | y | |
---|---|---|

0 | a | 1 |

1 | b | 2 |

2 | c | 3 |

3 | d | 4 |

4 | e | 5 |

In [7]:

```
# The order of the rows are different,
# but each x is still in a row with the same y.
# This is NOT what we want.
data.sample(data.shape[0])
```

Out[7]:

x | y | |
---|---|---|

0 | a | 1 |

3 | d | 4 |

1 | b | 2 |

4 | e | 5 |

2 | c | 3 |

**Solution:**Use`np.random.permutation`

, which takes in a sequence and returns a shuffled version of it, as an array.

In [8]:

```
# Random!
np.random.permutation(data.get('x'))
```

Out[8]:

array(['c', 'e', 'a', 'd', 'b'], dtype=object)

In [9]:

```
data.assign(shuffled_x=np.random.permutation(data.get('x')))
```

Out[9]:

x | y | shuffled_x | |
---|---|---|---|

0 | a | 1 | c |

1 | b | 2 | e |

2 | c | 3 | d |

3 | d | 4 | b |

4 | e | 5 | a |

As mentioned before, we'll shuffle the `'Maternal Smoker'`

column.

In [10]:

```
babies_with_shuffled = babies.assign(
Shuffled_Labels=np.random.permutation(babies.get('Maternal Smoker'))
)
babies_with_shuffled
```

Out[10]:

Maternal Smoker | Birth Weight | Shuffled_Labels | |
---|---|---|---|

0 | False | 120 | False |

1 | False | 113 | True |

2 | True | 128 | False |

... | ... | ... | ... |

1171 | True | 130 | False |

1172 | False | 125 | True |

1173 | False | 117 | True |

1174 rows × 3 columns

Let's look at the distributions of the two new samples we just generated.

In [11]:

```
fig, ax = plt.subplots()
baby_bins = np.arange(50, 200, 5)
smokers = babies_with_shuffled[babies_with_shuffled.get('Shuffled_Labels')]
non_smokers = babies_with_shuffled[babies_with_shuffled.get('Shuffled_Labels') == False]
non_smokers.plot(kind='hist', y='Birth Weight', density=True, ax=ax, alpha=0.75, bins=baby_bins, ec='w', figsize=(10, 5))
smokers.plot(kind='hist',y='Birth Weight', density=True, ax=ax, alpha=0.75, bins=baby_bins)
plt.legend(['Maternal Smoker = False', 'Maternal Smoker = True'])
plt.xlabel('Birth Weight');
```

What do you notice? 👀

In [12]:

```
babies_with_shuffled.groupby('Shuffled_Labels').mean().get(['Birth Weight'])
```

Out[12]:

Birth Weight | |
---|---|

Shuffled_Labels | |

False | 119.25 |

True | 119.79 |

In [13]:

```
group_means = babies_with_shuffled.groupby('Shuffled_Labels').mean().get('Birth Weight')
group_means.loc[False] - group_means.loc[True]
```

Out[13]:

-0.5355241708182774

In [14]:

```
def difference_in_group_means(weights_df):
group_means = weights_df.groupby('Shuffled_Labels').mean().get('Birth Weight')
return group_means.loc[False] - group_means.loc[True]
difference_in_group_means(babies_with_shuffled)
```

Out[14]:

-0.5355241708182774

- This was just one random shuffle.

- How likely is it that a random shuffle results in a 9.26+ ounce difference in means?

- We have to repeat the shuffling a bunch of times. On each iteration:
- Shuffle the labels to create two new samples.
- Add them as a column to the DataFrame.
- Compute the difference in group means in the two new samples and store the result.

In [15]:

```
n_repetitions = 500 # The dataset is large, so it takes too long to run if we use 5000 or 10000
differences = np.array([])
for i in np.arange(n_repetitions):
# Step 1: Shuffle the labels to create two new samples.
shuffled_labels = np.random.permutation(babies.get('Maternal Smoker'))
# Step 2: Add them as a column to the DataFrame.
shuffled = babies_with_shuffled.assign(Shuffled_Labels=shuffled_labels)
# Step 3: Compute the difference in group means in the two new samples and store the result.
difference = difference_in_group_means(shuffled)
differences = np.append(differences, difference)
differences
```

Out[15]:

array([-0.69, 0.01, -0.6 , ..., -0.91, -0.35, 0.73])

In [16]:

```
(bpd.DataFrame()
.assign(simulated_diffs=differences)
.plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
);
```

- Note that the empirical distribution of the test statistic (difference in means) is centered around 0.
- This matches our intuition – if the null hypothesis is true, there should be no difference in the group means on average.

Where does our observed statistic lie?

In [17]:

```
(bpd.DataFrame()
.assign(simulated_diffs=differences)
.plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
);
plt.axvline(diff_in_means, color='black', linewidth=4, label='observed difference in means')
plt.legend();
```

In [18]:

```
smoker_p_value = np.count_nonzero(differences >= diff_in_means) / n_repetitions
smoker_p_value
```

Out[18]:

0.0

- Under the null hypothesis, we rarely see differences as large as 9.26 ounces.

- Still, we can't conclude that smoking
*causes*lower birth weight because there may be other factors at play. For instance, maybe smokers are more likely to drink caffeine, and caffeine causes lower birth weight.

In [19]:

```
show_permutation_testing_summary()
```

Recall, `babies`

has two columns.

In [20]:

```
babies.take(np.arange(3))
```

Out[20]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

To randomly assign weights to groups, we shuffled `'Maternal Smoker'`

column. Could we have shuffled the `'Birth Weight'`

column instead?

- A. Yes
- B. No

- On January 18, 2015, the New England Patriots played the Indianapolis Colts for a spot in the Super Bowl.
- The Patriots won, 45-7. They went on to win the Super Bowl.
- After the game,
**it was alleged that the Patriots intentionally deflated footballs**, making them easier to catch. This scandal was called "Deflategate."

- Each team brings 12 footballs to the game. Teams use their own footballs while on offense.

- NFL rules stipulate that
**each ball must be inflated to between 12.5 and 13.5 pounds per square inch (psi)**.

- Before the game, officials found that all of the Patriots' footballs were at about 12.5 psi, and that all of the Colts' footballs were at about 13.0 psi.
- This pre-game data was not written down.

- At halftime, two officials (Clete Blakeman and Dyrol Prioleau) independently measured the pressures of as many of the 24 footballs as they could.
- They ran out of time before they could finish.

- Note that the relevant quantity is the
**change in pressure**from the start of the game to the halftime.- The Patriots' balls
*started*at a lower psi (which is not an issue on its own). - The allegations were that the Patriots
**deflated**their balls, during the game.

- The Patriots' balls

In [21]:

```
footballs = bpd.read_csv('data/footballs.csv')
footballs
```

Out[21]:

Team | Pressure | PressureDrop | |
---|---|---|---|

0 | Patriots | 11.65 | 0.85 |

1 | Patriots | 11.03 | 1.48 |

2 | Patriots | 10.85 | 1.65 |

... | ... | ... | ... |

11 | Colts | 12.53 | 0.47 |

12 | Colts | 12.72 | 0.28 |

13 | Colts | 12.35 | 0.65 |

14 rows × 3 columns

- There are only 14 rows (10 for Patriots footballs, 4 for Colts footballs) since the officials weren't able to record the pressures of every ball.
- The
`'Pressure'`

column records the average of the two officials' measurements at halftime. - The
`'PressureDrop'`

column records the difference between the estimated starting pressure and the average recorded`'Pressure'`

of each football.

Did the Patriots' footballs drop in pressure more than the Colts'?

**Null Hypothesis**: The drops in pressures for both teams came from the same distribution.- By chance, the Patriots' footballs deflated more.

**Alternative Hypothesis**: No, the Patriots' footballs deflated more than one would expect due to random chance alone.

Similar to the baby weights example, our test statistic will be the difference between the teams' average pressure drops. We'll calculate the mean drop for the `'Patriots'`

minus the mean drop for the `'Colts'`

.

In [22]:

```
means = footballs.groupby('Team').mean().get('PressureDrop')
means
```

Out[22]:

Team Colts 0.47 Patriots 1.21 Name: PressureDrop, dtype: float64

In [23]:

```
# Calculate the observed statistic.
observed_difference = means.loc['Patriots'] - means.loc['Colts']
observed_difference
```

Out[23]:

0.7362500000000001

The average pressure drop for the Patriots was about 0.74 psi more than the Colts.

We'll run a permutation test to see if 0.74 psi is a significant difference.

- To do this, we'll need to repeatedly shuffle either the
`'Team'`

or the`'PressureDrop'`

column. - We'll shuffle the
`'PressureDrop'`

column. - Tip: It's a good idea to simulate one value of the test statistic before putting everything in a
`for`

-loop.

In [24]:

```
# For simplicity, keep only the columns that are necessary for the test:
# one column of group labels and one column of numerical values.
footballs = footballs.get(['Team', 'PressureDrop'])
footballs
```

Out[24]:

Team | PressureDrop | |
---|---|---|

0 | Patriots | 0.85 |

1 | Patriots | 1.48 |

2 | Patriots | 1.65 |

... | ... | ... |

11 | Colts | 0.47 |

12 | Colts | 0.28 |

13 | Colts | 0.65 |

14 rows × 2 columns

In [25]:

```
# Shuffle one column.
# We chose to shuffle the numerical data (pressure drops), but we could have shuffled the group labels (team names) instead.
shuffled_drops = np.random.permutation(footballs.get('PressureDrop'))
shuffled_drops
```

Out[25]:

array([0.72, 0.85, 1.18, 1.65, 0.28, 1.23, 1.8 , 1.35, 1.48, 0.47, 0.65, 0.42, 1.38, 0.47])

In [26]:

```
# Add the shuffled column back to the DataFrame.
shuffled = footballs.assign(Shuffled_Drops=shuffled_drops)
shuffled
```

Out[26]:

Team | PressureDrop | Shuffled_Drops | |
---|---|---|---|

0 | Patriots | 0.85 | 0.72 |

1 | Patriots | 1.48 | 0.85 |

2 | Patriots | 1.65 | 1.18 |

... | ... | ... | ... |

11 | Colts | 0.47 | 0.42 |

12 | Colts | 0.28 | 1.38 |

13 | Colts | 0.65 | 0.47 |

14 rows × 3 columns

In [27]:

```
# Calculate the group means for the two randomly created groups.
team_means = shuffled.groupby('Team').mean().get('Shuffled_Drops')
team_means
```

Out[27]:

Team Colts 0.73 Patriots 1.10 Name: Shuffled_Drops, dtype: float64

In [28]:

```
# Calcuate the difference in group means (Patriots minus Colts) for the randomly created groups.
team_means.loc['Patriots'] - team_means.loc['Colts']
```

Out[28]:

0.36875000000000013

- Repeat the process many times by wrapping it inside a
`for`

-loop. - Keep track of the difference in group means in an array, appending each time.
- Optionally, create a function to calculate the difference in group means.

In [29]:

```
def difference_in_mean_pressure_drops(pressures_df):
team_means = pressures_df.groupby('Team').mean().get('Shuffled_Drops')
return team_means.loc['Patriots'] - team_means.loc['Colts']
```

In [30]:

```
n_repetitions = 5000 # The dataset is much smaller than in the baby weights example, so a larger number of repetitions will still run quickly.
differences = np.array([])
for i in np.arange(n_repetitions):
# Step 1: Shuffle the pressure drops.
shuffled_drops = np.random.permutation(footballs.get('PressureDrop'))
# Step 2: Put them in a DataFrame.
shuffled = footballs.assign(Shuffled_Drops=shuffled_drops)
# Step 3: Compute the difference in group means and add the result to the differences array.
difference = difference_in_mean_pressure_drops(shuffled)
differences = np.append(differences, difference)
differences
```

Out[30]:

array([ 0.35, -0.38, 0.07, ..., -0.02, -0.39, 0.27])

In [31]:

```
bpd.DataFrame().assign(SimulatedDifferenceInMeans=differences).plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
plt.axvline(observed_difference, color='black', linewidth=4, label='observed difference in means')
plt.legend();
```

It doesn't look good for the Patriots. What is the p-value?

- Recall, the p-value is the probability, under the null hypothesis, of seeing a result
**as or more extreme**than the observation. - In this case, that's the probability of the difference in mean pressure drops being greater than or equal to 0.74 psi.

In [32]:

```
np.count_nonzero(differences >= observed_difference) / n_repetitions
```

Out[32]:

0.0034

*highly* statistically significant ($p<0.01$).

- We reject the null hypothesis, as it is unlikely that the difference in mean pressure drops is due to chance alone.
- But this doesn't establish
**causation**. - That is, we can't conclude that the Patriots
**intentionally**deflated their footballs.

Quote from an investigative report commissioned by the NFL:

“[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi, depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls and 13.0 for the Colts balls.”

- Many different methods were used to determine whether the drop in pressures were due to chance, including physics.
- We computed an observed difference of 0.74, which is in line with the findings of the report.

- In the end, Tom Brady (quarterback for the Patriots at the time) was suspended 4 games and the team was fined $1 million dollars.
- The Deflategate Wikipedia article is extremely thorough; give it a read if you're curious!

- Permutation tests help us determine if
**two samples**came from the same population. We can answers questions like:- "Do smokers' babies and non-smokers babies' weigh the same?"
- More generally: are these things like those things?

- Permutation testing strategy:
- Create a "population" by pooling data from both samples together.
- Randomly divide this "population" into two groups of the same sizes as the original samples.
- Repeat this process, calculating the test statistic for each pair of random groups.
- Generate an empirical distribution of test statistics and see whether the observed statistic is consistent with it.

- Implementation:
- To randomly divide the "population" into two groups of the same sizes as the original samples, we'll just shuffle the group labels and use the shuffled group labels to define the two random groups.

- Permutation tests are one way to perform A/B tests.
- These are both also hypothesis tests.
- An A/B test aims to determine if two samples are from the same population (the name comes from giving names to the samples – sample A and sample B).
- We implemented A/B tests by using permutations. Butside of this class, permutation tests can be used for other purposes, and A/B tests can be done without permutations.
**For us, they mean the same thing, so if you see A/B test anywhere in the class, it refers to a permutation test.**

- We'll switch our focus to
**prediction**– given a sample, what can I predict about data not in that sample? - In the next 3 lectures, we'll focus on
**linear regression**, a prediction technique that tries to find the best "linear relationship" between two numerical variables.- Along the way, we'll address another idea –
**correlation**. - You will see linear regression in many more courses – it is one of the most important tools in the data science toolkit.

- Along the way, we'll address another idea –