Aside: Fast Permutation Tests¶
from dsc80_utils import *
Speeding things up 🏃¶
Speeding up permutation tests¶
- A permutation test, like all simulation-based hypothesis tests, generates an approximation of the distribution of the test statistic.
- If we found all permutations, the distribution would be exact!
- If there are $a$ elements in one group and $b$ in the other, the total number of permutations is ${a + b \choose a}$.
- If $a = 100$ and $b = 150$, there are more than ${250 \choose 100} \approx 6 \cdot 10^{71}$ permutations!
- The more repetitions we use, the better our approximation will be.
- Unfortunately, our code is pretty slow, so we can't use many repetitions.
Example: Birth weight and smoking 🚬¶
baby = pd.read_csv('data/babyweights.csv')
baby = baby[['Maternal Smoker', 'Birth Weight']]
baby.head()
Maternal Smoker | Birth Weight | |
---|---|---|
0 | False | 120 |
1 | False | 113 |
2 | True | 128 |
3 | True | 108 |
4 | False | 136 |
Recall our permutation test from last class:
- Null Hypothesis: In the population, birth weights of smokers' babies and non-smokers' babies have the same distribution, and the observed differences in our samples are due to random chance.
- Alternative Hypothesis: In the population, smokers' babies have lower birth weights than non-smokers' babies, on average. The observed difference in our samples cannot be explained by random chance alone.
Timing the birth weights example ⏰¶
We'll use 3000 repetitions instead of 500.
%%time
n_repetitions = 3000
differences = []
for _ in range(n_repetitions):
# Step 1: Shuffle the weights and store them in a DataFrame.
with_shuffled = baby.assign(Shuffled_Weights=np.random.permutation(baby['Birth Weight']))
# Step 2: Compute the test statistic.
# Remember, alphabetically, False comes before True,
# so this computes True - False.
group_means = (
with_shuffled
.groupby('Maternal Smoker')
.mean()
.loc[:, 'Shuffled_Weights']
)
difference = group_means.diff().iloc[-1]
# Step 3: Store the result.
differences.append(difference)
CPU times: user 1.98 s, sys: 15.7 ms, total: 2 s Wall time: 2 s
fig = px.histogram(pd.DataFrame(differences), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Test Statistic, Original Approach')
fig.update_layout(xaxis_range=[-5, 5])
A faster approach¶
- Our previous approach involved calling
groupby
inside of a loop. - We can avoid
groupby
entirely! - Let's start by generating a Boolean array of size
(3000, 1174)
.- Each row will correspond to a single permutation of the
'Maternal Smoker'
(bool
) column.
- Each row will correspond to a single permutation of the
is_smoker = baby['Maternal Smoker'].values
weights = baby['Birth Weight'].values
is_smoker
array([False, False, True, ..., True, False, False])
%%time
np.random.seed(24) # So that we get the same results each time (for lecture).
# We are still using a for-loop!
is_smoker_permutations = np.column_stack([
np.random.permutation(is_smoker)
for _ in range(3000)
]).T
CPU times: user 79.1 ms, sys: 5.77 ms, total: 84.9 ms Wall time: 80.9 ms
In is_smoker_permutatons
, each row is a new simulation.
False
means that baby comes from a non-smoking mother.True
means that baby comes from a smoking mother.
is_smoker_permutations
array([[ True, True, False, ..., True, True, False], [ True, False, True, ..., True, False, False], [False, True, False, ..., True, False, True], ..., [False, False, True, ..., False, False, True], [ True, True, False, ..., True, False, True], [ True, True, True, ..., False, True, True]])
is_smoker_permutations.shape
(3000, 1174)
Note that each row has 459 True
s and 715 False
s – it's just the order of them that's different.
is_smoker_permutations.sum(axis=1)
array([459, 459, 459, ..., 459, 459, 459])
The first row of is_smoker_permutations
tells us that in this permutation, we'll assign baby 1 to "smoker", baby 2 to "smoker", baby 3 to "non-smoker", and so on.
is_smoker_permutations[0]
array([ True, True, False, ..., True, True, False])
Broadcasting¶
- If we multiply
is_smoker_permutations
byweights
, we will create a new (3000, 1174) array, in which:- the weights of babies assigned to "smoker" are present, and
- the weights of babies assigned to "non-smoker" are 0.
is_smoker_permutations
is a "mask".
First, let's try this on just the first permutation (i.e. the first row of is_smoker_permutations
).
weights * is_smoker_permutations[0]
array([120, 113, 0, ..., 130, 125, 0])
Now, on all of is_smoker_permutations
:
weights * is_smoker_permutations
array([[120, 113, 0, ..., 130, 125, 0], [120, 0, 128, ..., 130, 0, 0], [ 0, 113, 0, ..., 130, 0, 117], ..., [ 0, 0, 128, ..., 0, 0, 117], [120, 113, 0, ..., 130, 0, 117], [120, 113, 128, ..., 0, 125, 117]])
The mean of the non-zero entries in a row is the mean of the weights of "smoker" babies in that permutation.
Why can't we use .mean(axis=1)
?
n_smokers = is_smoker.sum()
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_smokers
array([118.94, 118.08, 120.39, ..., 119.77, 119.27, 120.22])
mean_smokers.shape
(3000,)
We also need to get the weights of the non-smokers in our permutations. We can do this by "inverting" the is_smoker_permutations
mask and performing the same calculations.
n_non_smokers = 1174 - n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
mean_non_smokers
array([119.8 , 120.35, 118.87, ..., 119.27, 119.58, 118.97])
test_statistics = mean_smokers - mean_non_smokers
test_statistics
array([-0.86, -2.28, 1.52, ..., 0.5 , -0.31, 1.25])
Putting it all together¶
%%time
is_smoker = baby['Maternal Smoker'].values
weights = baby['Birth Weight'].values
n_smokers = is_smoker.sum()
n_non_smokers = 1174 - n_smokers
is_smoker_permutations = np.column_stack([
np.random.permutation(is_smoker)
for _ in range(3000)
]).T
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
fast_differences = mean_smokers - mean_non_smokers
CPU times: user 82.8 ms, sys: 3.49 ms, total: 86.3 ms Wall time: 84.7 ms
fig = px.histogram(pd.DataFrame(fast_differences), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Test Statistic, Faster Approach')
fig.update_layout(xaxis_range=[-5, 5])
The distribution of test statistics with the fast simulation is similar to the original distribution of test statistics.
Other performance considerations¶
np.random.permutation
(fast) vs df.sample
(slow)¶
In lecture, we mentioned the fact that np.random.permutation
is faster than using the df.sample
method. It's because df.sample
has to shuffle the index as well.
How fast does a single shuffle take for each approach?
to_shuffle = baby.copy()
weights = to_shuffle['Birth Weight']
%%timeit
np.random.permutation(weights.values)
15.6 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%%timeit
weights.sample(frac=1)
47.1 µs ± 485 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Adding columns in place (fast) vs. assign
(slow)¶
If you need extra performance, don't use assign
; instead, add the new column in-place.
Why? This way, we don't create a new copy of our DataFrame on each iteration.
%%timeit
to_shuffle['Shuffled_Weights'] = np.random.permutation(weights.values)
32.2 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
to_shuffle.assign(Shuffled_Weights=np.random.permutation(weights.values))
80 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)