Lecture 12 – Missing Values, Continued

DSC 80, Spring 2022

Announcements

Agenda

Remember: today's lecture is in scope for the Midterm Exam!

Missingness mechanisms

Review: missingness mechanisms

Flowchart

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔
$$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔
$$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔
$$\downarrow$$

Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

Discussion Question

In each of the following examples, decide whether the missing data are MD, NMAR, MAR, or MCAR:

Why do we care again?

Formal definition: MCAR

Suppose we have:

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

Formal definition: MAR

Suppose we have:

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

Formal definition: NMAR

Suppose we have:

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

Assessing missingness through data

Assessing missingness through data

Assessing NMAR

Assessing MAR

Assessing MCAR

Phone Screen Size Price
iPhone 13 6.06 999
Galaxy Z Fold 3 7.6 NaN
OnePlus 9 Pro 6.7 799
iPhone 12 Pro Max 6.68 NaN

Assessing MCAR

Suppose you have a DataFrame with columns named 'col_1', 'col_2', ..., 'col_k', and want to test whether values in 'col_X' are MCAR.

The following pseudocode describes an algorithm for testing whether 'col_X''s missingness is independent of all other columns in the DataFrame:

for i = 1, 2, ..., k, where i != X:
    look at the distribution of col_i when col_X is missing
    look at distribution of col_i when col_X is not missing
    check if these two distributions are the same
    if so, then col_X's missingness doesn't depend on col_i
    if not, col_X is MAR dependent on col_i
if all pairs of distributions were the same, 
then col_X is MCAR

We need to make precise what we mean by "the same"!

Example: Heights

Note that there currently aren't any missing values in heights.

We have three numerical columns – 'father', 'mother', and 'child'. Let's visualize them simultaneously.

Simulating MCAR data

Aside: Why is the value for 'child' in the above Series not exactly 0.3?

Verifying that child heights are MCAR in heights_mcar

Comparing null and non-null 'child' distributions for 'gender'

Comparing null and non-null 'child' distributions for 'gender'

Answer:

Simulation

The code to run our simulation largely looks the same as in previous permutation tests.

Results

Comparing null and non-null 'child' distributions for 'father'

Concluding that 'child' is MCAR

Simulating MAR data

Now, we will make 'child' heights MAR by deleting 'child' heights according to a random procedure that depends on other columns.

Comparing null and non-null 'child' distributions for 'gender', again

This time, the distribution of 'gender' in the two groups is very different.

Comparing null and non-null 'child' distributions for 'father', again

Observation:

The Kolmogorov-Smirnov test statistic

Recap: permutation tests

Difference in means

The difference in means works well in some cases. Let's look at one such case.

Below, we artificially generate two numerical datasets.