Aside: Fast Permutation Tests

Speeding things up 🏃

Speeding up permutation tests

Example: Birth weight and smoking 🚬

Recall our permutation test from last class:

Timing the birth weights example ⏰

We'll use 3000 repetitions instead of 500.

A faster approach

In is_smoker_permutatons, each row is a new simulation.

Note that each row has 459 Trues and 715 Falses – it's just the order of them that's different.

The first row of is_smoker_permutations tells us that in this permutation, we'll assign baby 1 to "smoker", baby 2 to "smoker", baby 3 to "non-smoker", and so on.

Broadcasting

First, let's try this on just the first permutation (i.e. the first row of is_smoker_permutations).

Now, on all of is_smoker_permutations:

The mean of the non-zero entries in a row is the mean of the weights of "smoker" babies in that permutation.

Why can't we use .mean(axis=1)?

We also need to get the weights of the non-smokers in our permutations. We can do this by "inverting" the is_smoker_permutations mask and performing the same calculations.

Putting it all together

The distribution of test statistics with the fast simulation is similar to the original distribution of test statistics.

Other performance considerations

np.random.permutation (fast) vs df.sample (slow)

In lecture, we mentioned the fact that np.random.permutation is faster than using the df.sample method. It's because df.sample has to shuffle the index as well.

How fast does a single shuffle take for each approach?

Adding columns in place (fast) vs. assign (slow)

Don't use assign; instead, add the new column in-place.

Why? This way, we don't create a new copy of our DataFrame on each iteration.