Lecture 11 – Missing Values

DSC 80, Spring 2022

Announcements

Agenda

Speeding things up 🏃

Recap: permutation tests

Speeding up permutation tests

Example: Birth weight and smoking 🚬

Recall our permutation test from last class:

Timing the birth weights example ⏰

We'll use 3000 repetitions instead of 500.

Minor improvements

Improvement 1: Use np.random.permutation instead of df.sample.

Why? This way, we don't need to shuffle index as well. This is how you ran permutation tests in DSC 10.

Improvement 2: Don't use assign; instead, add the new column in-place.

Why? This way, we don't create a new copy of our DataFrame on each iteration.

Let's try out both of these improvements, again with 3000 repetitions.

The distribution of test statistics generated by the faster approach resembles the distribution of test statistics generated by the original approach. Both are doing the same thing!

An even faster approach

In is_smoker_permutatons, each row is a new simulation.

Note that each row has 459 Trues and 715 Falses – it's just the order of them that's different.

The first row of is_smoker_permutations tells us that in this permutation, we'll assign baby 1 to "smoker", baby 2 to "smoker", baby 3 to "non-smoker", and so on.

Broadcasting

First, let's try this on just the first permutation (i.e. the first row of is_smoker_permutations).

Now, on all of is_smoker_permutations:

The mean of the non-zero entries in a row is the mean of the weights of "smoker" babies in that permutation.

Why can't we use .mean(axis=1)?

We also need to get the weights of the non-smokers in our permutations. We can do this by "inverting" the is_smoker_permutations mask and performing the same calculations.

Putting it all together

Again, the distribution of test statistics with the "ultra-fast" simulation is similar to the original distribution of test statistics.

Missingness mechanisms

_Good resources: course notes, Wikipedia, this textbook page_

Imperfect data

Imperfect data

We will focus on the second problem.

Types of missingness

There are four key ways in which values can be missing. It is important to distinguish between these types so that we can correctly handle missing data (Lecture 13).

Missing by design (MD)

Missing by design

Example: 'Car Type' and 'Car Colour' are missing if and only if 'Own a car?' is 'No'.

Other types of missingness

Mom... the dog ate my data! 🐶

Consider the following (contrived) example:

Discussion Question

We are now missing birth months for the first 10 people we surveyed. What is the missingness mechanism for birth months if:

  1. Cards were sorted by favorite color?
  2. Cards were sorted by birth month?
  3. Cards were shuffled?

Remember:

Discussion Question, solved

The real world is messy! 🌎

Not missing at random (NMAR)

Missing at random (MAR)

Missing completely at random (MCAR)

Isn't everything NMAR? 🤔

Flowchart

A good strategy is to assess missingness in the following order.

Missing by design (MD)

Can I determine the missing value exactly by looking at the other columns? 🤔
$$\downarrow$$

Not missing at random (NMAR)

Is there a good reason why the missingness depends on the values themselves? 🤔
$$\downarrow$$

Missing at random (MAR)

Do other columns tell me anything about the likelihood that a value is missing? 🤔
$$\downarrow$$

Missing completely at random (MCAR)
The missingness must not depend on other columns or the values themselves. 😄

Discussion Question

In each of the following examples, decide whether the missing data are MD, NMAR, MAR, or MCAR:

Why do we care again?

Formal definition: MCAR

Suppose we have:

Data is missing completely at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

Formal definition: MAR

Suppose we have:

Data is missing at random (MCAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$

That is, MAR data is actually MCAR, conditional on $Y_{obs}$.

Formal definition: NMAR

Suppose we have:

Data is not missing at random (NMAR) if

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.

Assessing missingness through data

Assessing missingness through data

Assessing NMAR

Assessing MAR

Assessing MCAR

Phone Screen Size Price
iPhone 13 6.06 999
Galaxy Z Fold 3 7.6 NaN
OnePlus 9 Pro 6.7 799
iPhone 12 Pro Max 6.68 NaN

Summary, next time

Summary, next time