In [1]:
from dsc80_utils import *
Aside: Fast Permutation Tests¶
Speeding things up 🏃¶
Speeding up permutation tests¶
- A permutation test, like all simulation-based hypothesis tests, generates an approximation of the distribution of the test statistic.
- If we found all permutations, the distribution would be exact!
- If there are $a$ elements in one group and $b$ in the other, the total number of permutations is ${a + b \choose a}$.
- If $a = 100$ and $b = 150$, there are more than ${250 \choose 100} \approx 6 \cdot 10^{71}$ permutations!
- The more repetitions we use, the better our approximation will be.
- Unfortunately, our code is pretty slow, so we can't use many repetitions.
Example: Birth weight and smoking 🚬¶
In [2]:
baby_path = Path('data') / 'babyweights.csv'
baby = pd.read_csv(baby_path)
baby = baby[['Maternal Smoker', 'Birth Weight']]
In [3]:
baby.head()
Out[3]:
Maternal Smoker | Birth Weight | |
---|---|---|
0 | False | 120 |
1 | False | 113 |
2 | True | 128 |
3 | True | 108 |
4 | False | 136 |
Recall our permutation test from last class:
- Null Hypothesis: In the population, birth weights of smokers' babies and non-smokers' babies have the same distribution, and the observed differences in our samples are due to random chance.
- Alternative Hypothesis: In the population, smokers' babies have lower birth weights than non-smokers' babies, on average. The observed difference in our samples cannot be explained by random chance alone.
Timing the birth weights example ⏰¶
We'll use 3000 repetitions instead of 500.
In [4]:
%%time
n_repetitions = 3000
differences = []
for _ in range(n_repetitions):
# Step 1: Shuffle the weights and store them in a DataFrame.
with_shuffled = baby.assign(Shuffled_Weights=np.random.permutation(baby['Birth Weight']))
# Step 2: Compute the test statistic.
# Remember, alphabetically, False comes before True,
# so this computes True - False.
group_means = (
with_shuffled
.groupby('Maternal Smoker')
.mean()
.loc[:, 'Shuffled_Weights']
)
difference = group_means.diff().iloc[-1]
# Step 3: Store the result.
differences.append(difference)
CPU times: user 1.9 s, sys: 1.4 ms, total: 1.9 s Wall time: 1.91 s
In [5]:
pio.renderers.default = 'plotly_mimetype+notebook' # If the plot doesn't load, uncomment this.
fig = px.histogram(pd.DataFrame(differences), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Test Statistic, Original Approach')
fig.update_layout(xaxis_range=[-5, 5])