import numpy as np
import pandas as pd
import os
import plotly.express as px
pd.options.plotting.backend = 'plotly'
baby_fp = os.path.join('data', 'baby.csv')
baby = pd.read_csv(baby_fp)
baby = baby[['Maternal Smoker', 'Birth Weight']]
baby.head()
Maternal Smoker | Birth Weight | |
---|---|---|
0 | False | 120 |
1 | False | 113 |
2 | True | 128 |
3 | True | 108 |
4 | False | 136 |
Recall our permutation test from last class:
We'll use 3000 repetitions instead of 500.
%%time
n_repetitions = 3000
differences = []
for _ in range(n_repetitions):
# Step 1: Shuffle the weights and store them in a DataFrame.
with_shuffled = baby.assign(Shuffled_Weights=np.random.permutation(baby['Birth Weight']))
# Step 2: Compute the test statistic.
# Remember, alphabetically, False comes before True,
# so this computes True - False.
group_means = (
with_shuffled
.groupby('Maternal Smoker')
.mean()
.loc[:, 'Shuffled_Weights']
)
difference = group_means.diff().iloc[-1]
# Step 3: Store the result.
differences.append(difference)
CPU times: user 2.04 s, sys: 3.15 ms, total: 2.05 s Wall time: 2.05 s
fig = px.histogram(pd.DataFrame(differences), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Test Statistic, Original Approach')
fig.update_layout(xaxis_range=[-5, 5])
groupby
inside of a loop.groupby
entirely!(3000, 1174)
.'Maternal Smoker'
(bool
) column.is_smoker = baby['Maternal Smoker'].values
weights = baby['Birth Weight'].values
is_smoker
array([False, False, True, ..., True, False, False])
%%time
np.random.seed(24) # So that we get the same results each time (for lecture).
# We are still using a for-loop!
is_smoker_permutations = np.column_stack([
np.random.permutation(is_smoker)
for _ in range(3000)
]).T
CPU times: user 89.4 ms, sys: 2.69 ms, total: 92 ms Wall time: 91.7 ms
In is_smoker_permutatons
, each row is a new simulation.
False
means that baby comes from a non-smoking mother.True
means that baby comes from a smoking mother.is_smoker_permutations
array([[ True, True, False, ..., True, True, False], [ True, False, True, ..., True, False, False], [False, True, False, ..., True, False, True], ..., [False, False, True, ..., False, False, True], [ True, True, False, ..., True, False, True], [ True, True, True, ..., False, True, True]])
is_smoker_permutations.shape
(3000, 1174)
Note that each row has 459 True
s and 715 False
s – it's just the order of them that's different.
is_smoker_permutations.sum(axis=1)
array([459, 459, 459, ..., 459, 459, 459])
The first row of is_smoker_permutations
tells us that in this permutation, we'll assign baby 1 to "smoker", baby 2 to "smoker", baby 3 to "non-smoker", and so on.
is_smoker_permutations[0]
array([ True, True, False, ..., True, True, False])
is_smoker_permutations
by weights
, we will create a new (3000, 1174) array, in which:is_smoker_permutations
is a "mask".First, let's try this on just the first permutation (i.e. the first row of is_smoker_permutations
).
weights * is_smoker_permutations[0]
array([120, 113, 0, ..., 130, 125, 0])
Now, on all of is_smoker_permutations
:
weights * is_smoker_permutations
array([[120, 113, 0, ..., 130, 125, 0], [120, 0, 128, ..., 130, 0, 0], [ 0, 113, 0, ..., 130, 0, 117], ..., [ 0, 0, 128, ..., 0, 0, 117], [120, 113, 0, ..., 130, 0, 117], [120, 113, 128, ..., 0, 125, 117]])
The mean of the non-zero entries in a row is the mean of the weights of "smoker" babies in that permutation.
Why can't we use .mean(axis=1)
?
n_smokers = is_smoker.sum()
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_smokers
array([118.94117647, 118.07625272, 120.38562092, ..., 119.76688453, 119.2745098 , 120.22222222])
mean_smokers.shape
(3000,)
We also need to get the weights of the non-smokers in our permutations. We can do this by "inverting" the is_smoker_permutations
mask and performing the same calculations.
n_non_smokers = 1174 - n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
mean_non_smokers
array([119.7972028 , 120.35244755, 118.86993007, ..., 119.26713287, 119.58321678, 118.97482517])
test_statistics = mean_smokers - mean_non_smokers
test_statistics
array([-0.85602633, -2.27619483, 1.51569085, ..., 0.49975166, -0.30870698, 1.24739705])
%%time
is_smoker = baby['Maternal Smoker'].values
weights = baby['Birth Weight'].values
n_smokers = is_smoker.sum()
n_non_smokers = 1174 - n_smokers
is_smoker_permutations = np.column_stack([
np.random.permutation(is_smoker)
for _ in range(3000)
]).T
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
fast_differences = mean_smokers - mean_non_smokers
CPU times: user 97.5 ms, sys: 3.59 ms, total: 101 ms Wall time: 100 ms
fig = px.histogram(pd.DataFrame(fast_differences), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Test Statistic, Faster Approach')
fig.update_layout(xaxis_range=[-5, 5])
The distribution of test statistics with the fast simulation is similar to the original distribution of test statistics.
np.random.permutation
(fast) vs df.sample
(slow)¶In lecture, we mentioned the fact that np.random.permutation
is faster than using the df.sample
method. It's because df.sample
has to shuffle the index as well.
How fast does a single shuffle take for each approach?
to_shuffle = baby.copy()
weights = to_shuffle['Birth Weight']
%%timeit
np.random.permutation(weights.values)
20.1 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
weights.sample(frac=1)
53.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
assign
(slow)¶Don't use assign
; instead, add the new column in-place.
Why? This way, we don't create a new copy of our DataFrame on each iteration.
%%timeit
to_shuffle['Shuffled_Weights'] = np.random.permutation(weights.values)
38.7 µs ± 451 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
to_shuffle.assign(Shuffled_Weights=np.random.permutation(weights.values))
121 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)