Lecture 11 – Permutation Testing, Missingness Mechanisms

Announcements¶

Lab 3's reflection form is due for extra credit tomorrow at 11:59PM.
Lab 4 (hypothesis and permutation testing) is due on Monday, February 6th at 11:59PM.
- See this post on Ed for clarifications.
Project 2 is due on Thursday, February 9th at 11:59PM.
- See this post on Ed for help with Question 7 if you'd like to finish the project before the weekend. (We'll cover the relevant lecture material on Monday.)
Several assignment grades have been released; check Gradescope and Ed for details.
- Many students' Project 1 grades increased last night 👀.

Agenda¶

Using permutation testing to compare two categorical distributions.
Missingness mechanisms.
- In what ways can data be missing? Why do we care?
- How do we identify missingness mechanisms using data?

Additional resources:

Permutation testing:
- Extra lecture notebook: Fast Permutation Tests.
- Great visualization.
Missingness mechanisms:

Differences between categorical distributions¶

Hypothesis testing vs. permutation testing¶

"Standard" hypothesis testing helps us answer questions of the form:

I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

Permutation testing helps us answer questions of the form:

I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

Example: Married vs. unmarried couples¶

Let's load in a cleaned version of the couples dataset from the last lecture.

In [2]:

couples_fp = os.path.join('data', 'married_couples_cleaned.csv')
couples = pd.read_csv(couples_fp)
couples.head()

Out[2]:

	mar_status	empl_status	gender	age
0	married	Working as paid employee	M	51
1	married	Working as paid employee	F	53
2	married	Working as paid employee	M	57
3	married	Working as paid employee	F	57
4	married	Working as paid employee	M	60

In [3]:

couples.sample(5)

Out[3]:

	mar_status	empl_status	gender	age
774	married	Not working - looking for work	M	35
668	unmarried	Working as paid employee	M	50
490	married	Working as paid employee	M	52
367	married	Not working - disabled	F	55
208	married	Working as paid employee	M	49

Understanding employment status in households¶

Do married households more often have a stay-at-home spouse?
Do households with unmarried couples more often have someone looking for work?
How much does the employment status of the different households vary?

To answer these questions, let's compute the distribution of employment status conditional on household type (married vs. unmarried).

In [4]:

# Note that this is a shortcut to picking a column for values and using aggfunc='count'.
empl_cnts = couples.pivot_table(index='empl_status', columns='mar_status', aggfunc='size')
cond_distr = empl_cnts / empl_cnts.sum()
cond_distr

Out[4]:

mar_status	married	unmarried
empl_status
Not working - disabled	0.048518	0.077055
Not working - looking for work	0.047844	0.118151
Not working - on a temporary layoff from a job	0.014151	0.022260
Not working - other	0.122642	0.056507
Not working - retired	0.063342	0.018836
Working as paid employee	0.610512	0.594178
Working, self-employed	0.092992	0.113014

Differences in the distributions¶

Are the distributions of employment status for married people and for unmarried people who live with their partners different?

Is this difference just due to noise?

Permutation test for household composition¶

Null Hypothesis: In the US, the distribution of employment status among those who are married is the same as among those who are unmarried and live with their partners. The difference between the two observed samples is due to chance.
Alternative Hypothesis: In the US, the distributions of employment status of the two groups are different.

Note that here, the US is our population, because the survey was conducted in the US (by the National Center for Family & Marriage Research).

Discussion Question¶

What is a good test statistic in this case?

Hint: What kind of distributions are we comparing?

Total variation distance¶

Whenever we need to compare two categorical distributions, we use the TVD.
- Recall, the TVD is the sum of the absolute differences in proportions, divided by 2.
In DSC 10, the only test statistic we ever used in permutation tests was the difference in group means/medians, but the TVD can be used in permutation tests as well.

In [6]:

cond_distr

Out[6]:

mar_status	married	unmarried
empl_status
Not working - disabled	0.048518	0.077055
Not working - looking for work	0.047844	0.118151
Not working - on a temporary layoff from a job	0.014151	0.022260
Not working - other	0.122642	0.056507
Not working - retired	0.063342	0.018836
Working as paid employee	0.610512	0.594178
Working, self-employed	0.092992	0.113014

Let's first compute the observed TVD, using our new knowledge of the diff method.

In [7]:

cond_distr.diff(axis=1).iloc[:, -1].abs().sum() / 2

Out[7]:

0.1269754089281099

Since we'll need to calculate the TVD repeatedly, let's define a function that computes it.

In [8]:

def tvd_of_groups(df, groups, cats):
    '''groups: the binary column (e.g. married vs. unmarried).
       cats: the categorical column (e.g. employment status).
    '''
    cnts = df.pivot_table(index=cats, columns=groups, aggfunc='size')
    # Normalize each column.
    distr = cnts / cnts.sum()
    # Compute and return the TVD.
    return distr.diff(axis=1).iloc[:, -1].abs().sum() / 2 

In [9]:

# Same result as above.
observed_tvd = tvd_of_groups(couples, groups='mar_status', cats='empl_status')
observed_tvd

Out[9]:

0.1269754089281099

Simulation¶

Under the null hypothesis, marital status is not related to employment status.
We can shuffle the marital status column and get an equally-likely dataset.
On each shuffle, we will compute the TVD.
Once we have many TVDs, we can ask, how often do we see a difference at least as large as our observed difference?

In [10]:

couples.head()

Out[10]:

	mar_status	empl_status	gender	age
0	married	Working as paid employee	M	51
1	married	Working as paid employee	F	53
2	married	Working as paid employee	M	57
3	married	Working as paid employee	F	57
4	married	Working as paid employee	M	60

Here, we'll shuffle marital statuses, though remember, we could shuffle employment statuses too.

In [11]:

couples.assign(shuffled_mar=np.random.permutation(couples['mar_status']))

Out[11]:

	mar_status	empl_status	gender	age	shuffled_mar
0	married	Working as paid employee	M	51	married
1	married	Working as paid employee	F	53	married
2	married	Working as paid employee	M	57	unmarried
3	married	Working as paid employee	F	57	unmarried
4	married	Working as paid employee	M	60	unmarried
...	...	...	...	...	...
2063	unmarried	Working as paid employee	F	42	married
2064	unmarried	Working as paid employee	M	60	married
2065	unmarried	Working as paid employee	F	53	married
2066	unmarried	Working as paid employee	M	44	unmarried
2067	unmarried	Working as paid employee	F	42	married

2068 rows × 5 columns

Let's do this repeatedly.

In [12]:

N = 1000
tvds = []

for _ in range(N):
    # Shuffle marital statuses.
    with_shuffled = couples.assign(shuffled_mar=np.random.permutation(couples['mar_status']))
    
    # Compute and store the TVD.
    tvd = tvd_of_groups(with_shuffled, groups='shuffled_mar', cats='empl_status')
    tvds.append(tvd)

Notice that by defining a function that computes our test statistic, our simulation code is much cleaner.

Conclusion of the test¶

In [13]:

fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=50, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
                   x=1.15 * observed_tvd, showarrow=False, y=0.055)

fig.update_layout(xaxis_range=[0, 0.2])
p_95 = np.percentile(tvds, 95)
fig.add_vline(x=p_95, line_color='purple')
annot_text = f'<span style="color:purple">The 95th percentile of our<br>empirical distribution is {round(p_95, 2)}.<br><br>'
annot_text += 'If our observed statistic is to the<br>right of this point, we will reject the null<br>at a 5% <b>significance level</b>.</span>'
fig.add_annotation(text=annot_text, x=1.5 * np.percentile(tvds, 95), showarrow=False, y=0.05)

We reject the null hypothesis that married/unmarried households have similar employment makeups.

We can't say anything about why the employment makeups are different, though!

Discussion Question¶

In the definition of the TVD, we divide the sum of the absolute differences in proportions between the two distributions by 2.

def tvd(a, b):
    return np.sum(np.abs(a - b)) / 2

Question: If we divided by 200 instead of 2, would we still reject the null hypothesis?

Missingness mechanisms¶

Imperfect data¶

When studying a problem, we are interested in understanding the true model in nature.
The data generating process is the "real-world" version of the model, that generates the data that we observe.
The recorded data is supposed to "well-represent" the data generating process, and subsequently the true model.

Example: Consider the upcoming Midterm Exam (Wednesday, February 15th, during lecture).
- The exam is meant to be a model of your true knowledge of DSC 80 concepts.
- The data generating process should give us a sense of your true knowledge, but is influenced by the specific questions on the exam, your preparation for the exam, whether or not you are sick on the day of the exam, etc.
- The recorded data consists of the final answers you write on the exam page.

Problem 1: Your data is not representative, i.e. you collected a poor sample.
- If the exam only asked questions about pivot_table, that would not give us an accurate picture of your understanding of DSC 80!

Problem 2: Some of the entries are missing.
- If you left some questions blank, why?

We will focus on the second problem.

Types of missingness¶

There are four key ways in which values can be missing. It is important to distinguish between these types so that we can correctly impute (fill in) the missing data.

Missing by design (MD).
Not missing at random (NMAR).
- Also called "non-ignorable" (NI).
Missing at random (MAR).
Missing completely at random (MCAR).

Missing by design (MD)¶

Values in a column are missing by design if:
- the designers of the data collection process intentionally decided to not collect data in that column,
- because it can be recovered from other columns.

If you can determine whether a value is missing solely using other columns, then the data is missing by design.
- For example: 'Age4' is missing if and only if 'Number of People' is less than 4.

Refer to this StackExchange link for more examples.

Missing by design¶

Example: 'Car Type?' and 'Car Colour?' are missing if and only if 'Own a car?' is 'No'.

Other types of missingness¶

Not missing at random (NMAR).
- The chance that a value is missing depends on the actual missing value!
- Weird name, because it's still random (in this context, "random" means something we can model using probability).

Missing completely at random (MCAR).
- The chance that a value is missing is completely independent of
  - other columns, and
  - the actual missing value.

Missing at random (MAR).
- The chance that a value is missing depends on other columns, but not the actual missing value itself.
- If a column is MAR, then it is MCAR when conditioned on some set of other columns.

Mom... the dog ate my data! 🐶¶

Consider the following (contrived) example:

We survey 100 people for their favorite color and birth month.
We write their answers on index cards.
- On the left side, we write colors.
- On the right side, we write birth months 📆.
A bad dog takes the top 10 cards from the stack and chews off the right side (birth months).
Now ten people are missing birth months!

Discussion Question¶

We are now missing birth months for the first 10 people we surveyed. What is the missingness mechanism for birth months if:

Cards were sorted by favorite color?
Cards were sorted by birth month?
Cards were shuffled?

Remember:

Not missing at random (NMAR): The chance that a value is missing depends on the actual missing value!
Missing at random (MAR): The chance that a value is missing depends on other columns, but not the actual missing value itself.
Missing completely at random (MCAR): The chance that a value is missing is completely independent of other columns and the actual missing value.

Discussion Question, solved¶

If cards were sorted by favorite color, then:
- The fact that a card is missing a month is related to the favorite color.
- Since the missingness depends on another column, we say values are missing at random (MAR).
  - The missingness doesn't depend on the actual missing values – if we fix a particular color, early months are no more likely to be missing than later months.

If cards were sorted by birth month, then:
- The fact that a card is missing a month is related to the missing month.
- Since the missingness depends on the actual missing values – early months are more likely to be missing than later months – we say values are not missing at random (NMAR).

If cards were shuffled, then:
- The fact that a card is missing a month is related to nothing.
- Since the missingness depends on nothing, we say values are missing completely at random (MCAR).

The real world is messy! 🌎¶

In our contrived example, the distinction between NMAR, MAR, and MCAR was relatively clear.
However, in more practical examples, it can be hard to distinguish between types of missingness.
Domain knowledge is often needed to understand why values might be missing.

Not missing at random (NMAR)¶

Data is NMAR if the chance that a value is missing depends on the actual missing value!
- It could also depend on other columns.
Another term for NMAR is "non-ignorable" – the fact that data is missing is data in and of itself that we cannot ignore.

Example: On an employment survey, people with really high incomes may be less likely to report their income.
- If we ignore missingness and compute the mean salary, our result will be biased low!

Example: A person doesn't take a drug test because they took drugs the day before.

When data is NMAR, we must reason about why the data is missing using domain expertise on the data generating process – the other columns in our data won't help.

Missing completely at random (MCAR)¶

Data is MCAR if the chance that a value is missing is completely independent of other columns and the actual missing value.

Example: After the Midterm Exam, I accidentally spill boba on the top of the stack. Assuming that the exams are in a random order, the exam scores that are lost due to this still will be MCAR. (Hopefully this doesn't happen!)

Missing at random (MAR)¶

Data is MAR if the chance that a value is missing depends on other columns, but not the actual missing value itself.

Example: People who work in the service industry may be less likely to report their income.
- If you look at service industry workers only, there is no pattern to the missingness of income (MCAR).
- If you look at corporate workers only, there is no pattern to the missingness of income (MCAR).
- Within each industry, missingness is MCAR, but overall, it is MAR, since the missingness of income depends on industry.

Example: An elementary school teacher keeps track of the health conditions of each student in their class. One day, a student doesn't show up for a test because they are at the hospital.
- The fact that their test score is missing has nothing to do with the test score itself.
- But the teacher could have predicted that the score would have been missing given the other information they had about the student.

Isn't everything NMAR? 🤔¶

You can argue that many of these examples are NMAR, by arguing that the missingness depends on the value of the data that is missing.
- For example, if a student is hospitalized, they may have lots of health problems and may not have spent much time on school, leading to their test scores being worse.
Fair point, but with that logic almost everything is NMAR.
What we really care about is the main reason data is missing.
If the other columns mostly explain the missing value and missingness, treat it as MAR.

Flowchart¶

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔 $$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔 $$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔 $$\downarrow$$

Missing completely at random (MCAR) The missingness must not depend on other columns or the values themselves. 😄

Discussion Question¶

In each of the following examples, decide whether the missing data are likely to be MD, NMAR, MAR, or MCAR:

A table for a medical study has columns for 'gender' and 'age'. 'age' has missing values.
Measurements from the Hubble Space Telescope are dropped during transmission.
A table has a single column, 'self-reported education level', which contains missing values.
A table of grades contains three columns, 'Version 1', 'Version 2', and 'Version 3'. $\frac{2}{3}$ of the entries in the table are NaN.

Why do we care again?¶

If a dataset contains missing values, it is likely not an accurate picture of the data generating process.
By identifying missingness mechanisms, we can best fill in missing values, to gain a better understanding of the DGP.

Formal definitions¶

We won't spend much time on these in lecture, but you may find them helpful.

Formal definition: MCAR¶

Suppose we have:

A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

Formal definition: MAR¶

Suppose we have:

A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

Formal definition: NMAR¶

Suppose we have:

A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

Assessing missingness through data¶

Suppose I believe that the missingness mechanism of a column is NMAR, MAR, or MCAR.
- I've ruled out missing by design (a good first step).
Can I check whether this is true, by looking at the data?

Assessing NMAR¶

We can't determine if data is NMAR just by looking at the data, as whether or not data is NMAR depends on the unobserved data.
To establish if data is NMAR, we must:
- reason about the data generating process, or
- collect more data.

Example: Consider a dataset of survey data of students' self-reported happiness. The data contains PIDs and happiness scores; nothing else. Some happiness scores are missing. Are happiness scores likely NMAR?

Assessing MAR¶

Data are MAR if the missingness only depends on observed data.
After reasoning about the data generating process, if you establish that data is not NMAR, then it must be either MAR or MCAR.
The more columns we have in our dataset, the "weaker the NMAR effect" is.
- Adding more columns -> controlling for more variables -> moving from NMAR to MAR.
- Example: With no other columns, income in a census is NMAR. But once we look at location, education, and occupation, incomes are closer to being MAR.

Deciding between MCAR and MAR¶

For data to be MCAR, the chance that values are missing should not depend on any other column or the values themselves.

Example: Consider a dataset of phones, in which we store the screen size and price of each phone. Some prices are missing.

Phone	Screen Size	Price
iPhone 14	6.06	999
Galaxy Z Fold 4	7.6	NaN
OnePlus 9 Pro	6.7	799
iPhone 13 Pro Max	6.68	NaN

If prices are MCAR, then the distribution of screen size should be the same for:
- phones whose prices are missing, and
- phones whose prices aren't missing.

We can use a permutation test to decide between MAR and MCAR! We are asking the question, did these two samples come from the same underlying distribution?

Summary, next time¶

Summary¶

Missing by design (MD): Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.
Not missing at random (NMAR, also called NI): The chance that a value is missing depends on the actual missing value!
Missing at random (MAR): The chance that a value is missing depends on other columns, but not the actual missing value itself.
Missing completely at random (MCAR): The chance that a value is missing is completely independent of other columns and the actual missing value.
Important: Refer to the Flowchart when deciding between missingness types.

Lecture 11 – Permutation Testing, Missingness Mechanisms¶

DSC 80, Winter 2023¶

Announcements¶

Agenda¶

Differences between categorical distributions¶

Hypothesis testing vs. permutation testing¶

Example: Married vs. unmarried couples¶

Understanding employment status in households¶

Differences in the distributions¶

Permutation test for household composition¶

Discussion Question¶

Total variation distance¶

Simulation¶

Conclusion of the test¶

Discussion Question¶

Missingness mechanisms¶

Imperfect data¶

Imperfect data¶

Types of missingness¶

Missing by design (MD)¶

Missing by design¶

Other types of missingness¶

Mom... the dog ate my data! 🐶¶

Discussion Question¶

Discussion Question, solved¶

The real world is messy! 🌎¶

Not missing at random (NMAR)¶

Missing completely at random (MCAR)¶

Missing at random (MAR)¶

Isn't everything NMAR? 🤔¶

Flowchart¶

Discussion Question¶

Why do we care again?¶

Formal definitions¶

Formal definition: MCAR¶

Formal definition: MAR¶

Formal definition: NMAR¶

Assessing missingness through data¶

Assessing missingness through data¶

Assessing NMAR¶

Assessing MAR¶

Deciding between MCAR and MAR¶

Summary, next time¶

Summary¶

Next time¶