Lecture 13 – Imputation

DSC 80, Winter 2023

📣 Announcements

Midterm Exam Logistics

Agenda

Recap: Identifying missingness mechanisms

Review: Missingness mechanisms

Deciding between MAR and MCAR

Recall, the "missing value flowchart" says that we should:

To decide between MAR and MCAR, we can look at the data itself.

Deciding between MAR and MCAR

Example: Heights

Today, we'll use the same heights dataset as we did last time.

Example: Missingness of 'child' heights on 'father''s heights (MCAR)

Aside: In util.py, there are several functions that we've created to help us with this lecture.

Example: Missingness of 'child' heights on 'father''s heights (MCAR)

Difference in means vs. K-S statistic

Example: Missingness of 'child' heights on 'father''s heights (MCAR)

The ks_2samp function from scipy.stats can do the entire permutation test for us, if we want to use the K-S statistic!

(If we want to use the difference of means, we'd have to run a for-loop.)

Discussion Question

In this MCAR example, if we were to take the mean of the 'child' column that contains missing values, is the result likely to:

  1. Overestimate the true mean?
  2. Underestimate the true mean?
  3. Be accurate?

Example: Missingness of 'child' heights on 'father''s heights (MAR)

Example: Missingness of 'child' heights on 'father''s heights (MAR)

Discussion Question

In this MAR example, if we were to take the mean of the 'child' column that contains missing values, is the result likely to:

  1. Overestimate the true mean?
  2. Underestimate the true mean?
  3. Be accurate?

Handling missing values

What do we do with missing data?

Solution 1: Dropping missing values

Listwise deletion

To illustrate, let's generate two datasets with missing 'child' heights – one in which the heights are MCAR, and one in which they are MAR dependent on 'gender' (not 'father', as in our previous example).

In practice, you'll have to run permutation tests to determine the likely missingness mechanism first!

Listwise deletion

Below, we compute the means and standard deviations of the 'child' column in all three datasets. Remember, .mean() and .std() ignore missing values.

Observations:

Solution 2: Imputation

Imputation is the act of filling in missing data with plausable values. Ideally, imputation:

These are hard to do at the same time!

Kinds of imputation

Mean imputation

Mean imputation

Example: Mean imputation in the MCAR heights dataset

Let's look at two distributions:

Mean imputation of MCAR data

Let's fill in missing values in heights_mcar['child'] with the mean of the observed 'child' heights in heights_mcar['child'].

Observations:

Mean imputation of MCAR data

Let's visualize all three distributions: the original, the MCAR heights with missing values, and the mean-imputed MCAR heights.

Takeaway: When data are MCAR and you impute with the mean:

Example: Mean imputation in the MAR heights dataset

The distributions are not very similar!

Remember that in reality, you won't get to see the turquoise distribution, which has no missing values – instead, you'll try to recreate it, using your sample with missing values.

Mean imputation of MAR data

Let's fill in missing values in heights_mar['child'] with the mean of the observed 'child' heights in heights_mar['child'] and see what happens.

Note that the latter two means are biased high.

Mean imputation of MAR data

Let's visualize all three distributions: the original, the MAR heights with missing values, and the mean-imputed MAR heights.

Since the sample with MAR values was already biased high, mean imputation kept the sample biased – it did not bring the data closer to the data generating process.

With our single mean imputation strategy, the resulting female mean height is biased quite high.

Within-group (conditional) mean imputation

transform returns!

The pink distribution does a better job of approximating the turquoise distribution than the purple distribution.

Conclusion: Imputation with single values

Probabilistic imputation

Imputing missing values using distributions

Example: Probabilistic imputation in the MCAR heights dataset

Step 1: Determine the number of missing values in the column of interest.

Step 2: Sample that number of values from the observed values in the column of interest.

Step 3: Fill in the missing values with the sample from Step 2.

Let's look at the results.

Variance is preserved!

No spikes!

Observations

Randomness

Multiple imputation

Steps:

  1. Start with observed and incomplete data.
  1. Create $m$ imputed versions of the data through a probabilistic procedure.
    • The imputed datasets are identical for the observed data entries.
    • They differ in the imputed values.
    • The differences reflect our uncertainty about what value to impute.
  1. Then, compute parameter estimates on each imputed dataset.
    • For instance, the mean, standard deviation, median, etc.
  1. Finally, pool the $m$ parameter estimates into one estimate.

Multiple imputation

Let's try this procedure out on the heights_mcar dataset.

Each time we run the following cell, it generates a new imputed version of the 'child' column.

Let's run the above procedure 100 times.

Let's plot some of the imputed columns on the previous slide.

Let's look at the distribution of means across the imputed columns.

Summary, next time

Summary of imputation techniques

Summary: Listwise deletion

Summary: Mean imputation

Summary: Conditional mean imputation

means = df.groupby('c2').mean().to_dict()
imputed = df['c1'].apply(lambda x: means[x] if np.isnan(x) else x)

Summary: Probabilistic imputation

Summary: Multiple imputation

Next time