Lecture 12 – Identifying Missingness Mechanisms

DSC 80, Spring 2023

Announcements

Agenda

Missingness mechanisms

Flowchart

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔
$$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔
$$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔
$$\downarrow$$

Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

Discussion Question

In each of the following examples, decide whether the missing data are likely to be MD, NMAR, MAR, or MCAR:

Why do we care again?

Formal definitions

We won't spend much time on these in lecture, but you may find them helpful.

Formal definition: MCAR

Suppose we have:

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

Formal definition: MAR

Suppose we have:

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

Formal definition: NMAR

Suppose we have:

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

Assessing missingness through data

Assessing missingness through data

Assessing NMAR

Assessing MAR

Deciding between MCAR and MAR

Phone Screen Size Price
iPhone 14 6.06 999
Galaxy Z Fold 4 7.6 NaN
OnePlus 9 Pro 6.7 799
iPhone 13 Pro Max 6.68 NaN

Deciding between MCAR and MAR

Suppose you have a DataFrame with columns named $\text{col}_1$, $\text{col}_2$, ..., $\text{col}_k$, and want to test whether values in $\text{col}_X$ are MCAR. To test whether $\text{col}_X$'s missingness is independent of all other columns in the DataFrame:

For $i = 1, 2, ..., k$, where $i \neq X$:

If all pairs of distribution were the same, then $\text{col}_X$ is MCAR.

Example: Heights

Proof that there aren't currently any missing values in heights:

We have three numerical columns – 'father', 'mother', and 'child'. Let's visualize them simultaneously.

Simulating MCAR data

Aside: Why is the value for 'child' in the above Series not exactly 0.3?

Verifying that child heights are MCAR in heights_mcar

Comparing null and non-null 'child' distributions for 'gender'

Comparing null and non-null 'child' distributions for 'gender'

To measure the "distance" between two categorical distributions, we use the total variation distance.

Note that with only two categories, the TVD is the same as the absolute difference in proportions for either category.

Simulation

The code to run our simulation largely looks the same as in previous permutation tests.

Results

Comparing null and non-null 'child' distributions for 'father'

We can visualize numerical distributions with histograms, or with kernel density estimates. (See the definition of create_kde_plotly at the top of the notebook if you're curious as to how these are created.)

Concluding that 'child' is MCAR

Simulating MAR data

Now, we will make 'child' heights MAR by deleting 'child' heights according to a random procedure that depends on other columns.

Comparing null and non-null 'child' distributions for 'gender', again

This time, the distribution of 'gender' in the two groups is very different.

Comparing null and non-null 'child' distributions for 'father', again

The Kolmogorov-Smirnov test statistic

Recap: Permutation tests

Difference in means

The difference in means works well in some cases. Let's look at one such case.

Below, we artificially generate two numerical datasets.

Discussion Question

Different distributions with the same mean

Let's generate two distributions that look very different but have the same mean.

In this case, if we use the difference in means as our test statistic in a permutation test, we will fail to reject the null that the two distributions are different.

Telling quantitative distributions apart

The Kolmogorov-Smirnov test statistic

Aside: cumulative distribution functions

Let's look at the CDFs of our two synthetic distributions.

The K-S statistic in Python

Fortunately, we don't need to calculate the K-S statistic ourselves! Python can do it for us (and you can use this pre-built version in all assignments).

We don't know if this number is big or small. We need to run a permutation test!

We were able to differentiate between the two distributions using the K-S test statistic!

ks_2samp

Summary, next time

Summary