Lecture 12 – Missing Values, Continued

DSC 80, Spring 2022

Announcements

Agenda

Remember: today's lecture is in scope for the Midterm Exam!

Missingness mechanisms

Review: missingness mechanisms

Flowchart

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔
$$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔
$$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔
$$\downarrow$$

Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

Discussion Question

In each of the following examples, decide whether the missing data are MD, NMAR, MAR, or MCAR:

Why do we care again?

Formal definition: MCAR

Suppose we have:

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

Formal definition: MAR

Suppose we have:

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

Formal definition: NMAR

Suppose we have:

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

Assessing missingness through data

Assessing missingness through data

Assessing NMAR

Assessing MAR

Assessing MCAR

Phone Screen Size Price
iPhone 13 6.06 999
Galaxy Z Fold 3 7.6 NaN
OnePlus 9 Pro 6.7 799
iPhone 12 Pro Max 6.68 NaN

Assessing MCAR

Suppose you have a DataFrame with columns named 'col_1', 'col_2', ..., 'col_k', and want to test whether values in 'col_X' are MCAR.

The following pseudocode describes an algorithm for testing whether 'col_X''s missingness is independent of all other columns in the DataFrame:

for i = 1, 2, ..., k, where i != X:
    look at the distribution of col_i when col_X is missing
    look at distribution of col_i when col_X is not missing
    check if these two distributions are the same
    if so, then col_X's missingness doesn't depend on col_i
    if not, col_X is MAR dependent on col_i
if all pairs of distributions were the same, 
then col_X is MCAR

We need to make precise what we mean by "the same"!

Example: Heights

Note that there currently aren't any missing values in heights.

We have three numerical columns – 'father', 'mother', and 'child'. Let's visualize them simultaneously.

Simulating MCAR data

Aside: Why is the value for 'child' in the above Series not exactly 0.3?

Verifying that child heights are MCAR in heights_mcar

Comparing null and non-null 'child' distributions for 'gender'

Comparing null and non-null 'child' distributions for 'gender'

Answer:

Simulation

The code to run our simulation largely looks the same as in previous permutation tests.

Results

Comparing null and non-null 'child' distributions for 'father'

Concluding that 'child' is MCAR

Simulating MAR data

Now, we will make 'child' heights MAR by deleting 'child' heights according to a random procedure that depends on other columns.

Comparing null and non-null 'child' distributions for 'gender', again

This time, the distribution of 'gender' in the two groups is very different.

Comparing null and non-null 'child' distributions for 'father', again

Observation:

The Kolmogorov-Smirnov test statistic

Recap: permutation tests

Difference in means

The difference in means works well in some cases. Let's look at one such case.

Below, we artificially generate two numerical datasets.

Discussion Question

Different distributions with the same mean

Let's generate two distributions that look very different but have the same mean.

In this case, if we use the difference in means as our test statistic in a permutation test, we will fail to reject the null that the two distributions are different.

Telling quantitative distributions apart

The Kolmogorov-Smirnov test statistic

Aside: cumulative distribution functions

Let's look at the CDFs of our two synthetic distributions.

The K-S statistic in Python

Fortunately, we don't need to calculate the K-S statistic ourselves! Python can do it for us (and you can use this pre-built version in all assignments).

We don't know if this number is big or small. We need to run a permutation test!

We were able to differentiate between the two distributions using the K-S test statistic!

ks_2samp

Difference in means vs. K-S statistic

More examples

Note: We are not going to get to these slides in class. They're just here to provide more examples of missingness mechanisms.

Summary: NMAR

Summary: MAR

Summary: MCAR

Example: Cars

Let's use a permutation test!

Missingness of 'car_color' on 'car_make'

Let's test whether the missingness of 'car_color' is dependent on 'car_make'.

Here, we fail to reject the null that the distribution of 'car_make' is the same whether or not 'car_color' is missing.

Example: Assessing missingness in payments data

Example: assessing missingness in payments data

Example: assessing missingness in payments data

Summary, next time

Summary, next time