Lecture 10 – Permutation Testing

DSC 80, Winter 2023

Announcements

Agenda

Example: Penguins (again!)

(source)

Consider the penguins dataset from a few lectures ago.

Average bill length by island

It appears that penguins on Torgersen Island have shorter bills on average than penguins on other islands.

Setup

The plan

Simulation

Again, while you could do this with a for-loop (and you can use a for-loop for hypothesis tests in labs and projects), we'll use the faster size approach here.

Instead of using np.random.multinomial, which samples from a categorical distribution, we'll use np.random.choice, which samples from a known sequence of values.

Visualizing the empirical distribution of the test statistic

It doesn't look like the average bill length of penguins on Torgersen Island came from the distribution of bill lengths of all penguins in our dataset.

Discussion Question

There is a statistical tool you've learned about that would allow us to find the true probability distribution of the test statistic in this case. What is it?


➡️ Click here to see the answer after you've thought about it. The Central Limit Theorem (CLT). Recall, the CLT tells us that for any population distribution, the distribution of the sample mean is roughly normal, with the same mean as the population mean. Furthermore, it tells that the standard deviation of the distribution of the sample mean is $\frac{\text{Population SD}}{\sqrt{\text{sample size}}}$. So, the distribution of sample means of samples of size 47 drawn from penguins['bill_length_mm'] is roughly normal with mean penguins['bill_length_mm'].mean() and standard deviation penguins['bill_length_mm'].std(ddof=0) / np.sqrt(47).

Permutation testing

Hypothesis testing vs. permutation testing

"Standard" hypothesis testing helps us answer questions of the form:

I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

It does not help us answer questions of the form:

I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

That's where permutation testing comes in.

Example: Birth weight and smoking 🚬

Note: For familiarity, we'll start with an example from DSC 10. This means we'll move quickly!

Birth weight and smoking 🚬

Let's start by loading in the data.

We're only interested in the 'Birth Weight' and 'Maternal Smoker' columns.

Note that there are two samples:

Exploratory data analysis

How many babies are in each group? What is the average birth weight within each group?

Note that 16 ounces are in 1 pound, so the above weights are ~7-8 pounds.

Visualizing birth weight distributions

Below, we draw the distributions of both sets of birth weights.

There appears to be a difference, but can it be attributed to random chance?

Hypothesis test setup

Null hypothesis: birth weights come from the same distribution

Alternative hypothesis: birth weights come from different distributions

Choosing a test statistic

We need a test statistic that can measure how different two numerical distributions are.

Easiest solution: Difference in group means.

Difference in group means

We'll choose our test statistic to be:

$$\text{mean weight of smokers' babies} - \text{mean weight of non-smokers' babies}$$

We could also compute the non-smokers' mean minus the smokers' mean, too.

At first, you may think to use loc with group_means to compute the difference in group means.

However, you can also use the diff method.

Hypothesis test setup

$$\text{mean weight of smokers' babies} - \text{mean weight of non-smokers' babies}$$

Implications of the null hypothesis

Permutation tests

Shuffling

How close are the means of the shuffled groups?

One benefit of shuffling 'Birth Weight' (instead of 'Maternal Smoker') is that grouping by 'Maternal Smoker' allows us to see all of the following information with a single call to groupby.

Let's visualize both pairs of distributions – what do you notice?

Simulating the empirical distribution of the test statistic

We already computed the observed statistic earlier, but we compute it again below to keep all of our calculations together.

Conclusion of the test

⚠️ Caution!

Differences between categorical distributions

Example: Married vs. unmarried couples

We won't use all of the columns in the DataFrame.

Cleaning the dataset

The numbers in the DataFrame correspond to the mappings below.

Understanding the couples dataset

Let's look at the distribution of age separately for married couples and unmarried couples.

How are these two distributions different? Why do you think there is a difference?

Understanding employment status in households

To answer these questions, let's compute the distribution of employment status conditional on household type (married vs. unmarried).

Since there are a different number of married and unmarried couples in the dataset, we can't compare the numbers above directly. We need to convert counts to proportions, separately for married and unmarried couples.

Both of the columns above sum to 1.

Differences in the distributions

Are the distributions of employment status for married people and for unmarried people who live with their partners different?

Is this difference just due to noise?

Permutation test for household composition

Discussion Question

What is a good test statistic in this case?

Hint: What kind of distributions are we comparing?

Total variation distance

Let's first compute the observed TVD, using our new knowledge of the diff method.

Since we'll need to calculate the TVD repeatedly, let's define a function that computes it.

Simulation

Here, we'll shuffle marital statuses, though remember, we could shuffle employment statuses too.

Let's do this repeatedly.

Notice that by defining a function that computes our test statistic, our simulation code is much cleaner.

Conclusion of the test

We reject the null hypothesis that married/unmarried households have similar employment makeups.

We can't say anything about why the employment makeups are different, though!

Discussion Question

In the definition of the TVD, we divide the sum of the absolute differences in proportions between the two distributions by 2.

def tvd(a, b):
    return np.sum(np.abs(a - b)) / 2

Question: If we divided by 200 instead of 2, would we still reject the null hypothesis?

Summary

Summary

Next time

Wrap up hypothesis and permutation testing. Revisit missing values.