Lecture 13 – Imputation

DSC 80, Spring 2022

Announcements

Agenda

Review: Missingness mechanisms

Review: Missingness mechanisms

Deciding between MAR and MCAR

Example: Missingness of 'child' heights on 'father''s heights (MCAR)

Discussion Question

In this MCAR example, if we were to take the mean of the 'child' column that contains missing values, is the result likely to:

  1. Overestimate the true mean?
  2. Underestimate the true mean?
  3. Be accurate?

Example: Missingness of 'child' heights on 'father''s heights (MAR)

Discussion Question

In this MAR example, if we were to take the mean of the 'child' column that contains missing values, is the result likely to:

  1. Overestimate the true mean?
  2. Underestimate the true mean?
  3. Be accurate?

Handling missing values

What do we do with missing data?

Example: Charity

Solution 1: Dropping missing values

Listwise deletion

To illustrate, let's generate another dataset with missing values.

The true 'child' mean with all of the data is as follows.

The 'child' mean in the MCAR dataset is very close to the true 'child' mean:

The 'child' mean in the MAR dataset is quite biased. Note that this is not the same example as before.

Solution 2: Imputation

Imputation is the act of filling in missing data with plausable values. Ideally, imputation:

These are hard to satisfy!

Kinds of imputation

Mean imputation

Mean imputation

Example: Mean imputation in the MCAR heights dataset

Let's look at two distributions:

Mean imputation of MCAR data

Let's take a look at all three distributions: the original, the MCAR heights with missing values, and the imputed MCAR heights.

Takeaway: When data are MCAR and you impute with the mean:

Example: Mean imputation in the MAR heights dataset

Again, let's look at two distributions:

Within-group (conditional) mean imputation

Note that with our single mean imputation strategy, the resulting male mean height is biased quite low.

Discussion Question

Conclusion: Imputation with single values

Discussion Question

Hint: What does the distribution of incomes look like? Where is the mean/median?

Probabilistic imputation

Imputing missing values using distributions

Example: Probabilistic imputation in the MCAR heights dataset

Steps:

  1. Figure out the number of missing values.
  2. Sample that number of values from the observed dataset.
  3. Fill in the missing values with the sample from Step 2.

Step 1: Figure out the number of missing values.

Step 2: Sample that number of values from the observed dataset.

Step 3: Fill in the missing values with the sample from Step 2.

Let's look at the results.

Variance is preserved!

No spikes!

Observations

Randomness

Multiple imputation

Steps:

  1. Start with observed and incomplete data.
  2. Create several imputed versions of the data through a probabilistic procedure.

    • The imputed datasets are identical for the observed data entries.
    • They differ in the imputed values.
    • The differences reflect our uncertainty about what value to impute.
  3. Then, estimate the parameters of interest for each imputed dataset.

    • For instance, the mean, standard deviation, median, etc.
  4. Finally, pool the m parameter estimates into one estimate.

Let's try this procedure out on the heights_mcar dataset.

Each time we run the following cell, it generates a new imputed version of the 'child' column.

Let's run the above procedure 100 times.

Let's plot some of the imputed columns above.

Let's look at the distribution of means across the imputed columns.

Summary

Summary of imputation techniques

Summary: listwise deletion

Summary: mean imputation

Summary: conditional mean imputation

Summary: probabilistic imputation

Summary: multiple imputation