Lecture 11 – Permutation Testing, Missingness Mechanisms

DSC 80, Winter 2023

Announcements

Agenda

Additional resources:

Differences between categorical distributions

Hypothesis testing vs. permutation testing

"Standard" hypothesis testing helps us answer questions of the form:

I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

Permutation testing helps us answer questions of the form:

I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

Example: Married vs. unmarried couples

Let's load in a cleaned version of the couples dataset from the last lecture.

Understanding employment status in households

To answer these questions, let's compute the distribution of employment status conditional on household type (married vs. unmarried).

Differences in the distributions

Are the distributions of employment status for married people and for unmarried people who live with their partners different?

Is this difference just due to noise?

Permutation test for household composition

Discussion Question

What is a good test statistic in this case?

Hint: What kind of distributions are we comparing?

Total variation distance

Let's first compute the observed TVD, using our new knowledge of the diff method.

Since we'll need to calculate the TVD repeatedly, let's define a function that computes it.

Simulation

Here, we'll shuffle marital statuses, though remember, we could shuffle employment statuses too.

Let's do this repeatedly.

Notice that by defining a function that computes our test statistic, our simulation code is much cleaner.

Conclusion of the test

We reject the null hypothesis that married/unmarried households have similar employment makeups.

We can't say anything about why the employment makeups are different, though!

Discussion Question

In the definition of the TVD, we divide the sum of the absolute differences in proportions between the two distributions by 2.

def tvd(a, b):
    return np.sum(np.abs(a - b)) / 2

Question: If we divided by 200 instead of 2, would we still reject the null hypothesis?

Missingness mechanisms

Imperfect data

Imperfect data

We will focus on the second problem.

Types of missingness

There are four key ways in which values can be missing. It is important to distinguish between these types so that we can correctly impute (fill in) the missing data.

Missing by design (MD)

Missing by design

Example: 'Car Type?' and 'Car Colour?' are missing if and only if 'Own a car?' is 'No'.

Other types of missingness

Mom... the dog ate my data! 🐶

Consider the following (contrived) example:

Discussion Question

We are now missing birth months for the first 10 people we surveyed. What is the missingness mechanism for birth months if:

  1. Cards were sorted by favorite color?
  2. Cards were sorted by birth month?
  3. Cards were shuffled?

Remember:

Discussion Question, solved

The real world is messy! 🌎

Not missing at random (NMAR)

Missing completely at random (MCAR)

Missing at random (MAR)

Isn't everything NMAR? 🤔

Flowchart

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔
$$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔
$$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔
$$\downarrow$$

Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

Discussion Question

In each of the following examples, decide whether the missing data are likely to be MD, NMAR, MAR, or MCAR:

Why do we care again?

Formal definitions

We won't spend much time on these in lecture, but you may find them helpful.

Formal definition: MCAR

Suppose we have:

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

Formal definition: MAR

Suppose we have:

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

Formal definition: NMAR

Suppose we have:

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

Assessing missingness through data

Assessing missingness through data

Assessing NMAR

Assessing MAR

Deciding between MCAR and MAR

Phone Screen Size Price
iPhone 14 6.06 999
Galaxy Z Fold 4 7.6 NaN
OnePlus 9 Pro 6.7 799
iPhone 13 Pro Max 6.68 NaN

Summary, next time

Summary

Next time