import pandas as pd
import numpy as np
import os
import util
import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'
Recall, the "missing value flowchart" says that we should:
To decide between MAR and MCAR, we can look at the data itself.
Today, we'll use the same heights
dataset as we did last time.
heights = pd.read_csv(os.path.join('data', 'midparent.csv'))
heights = (
heights
.rename(columns={'childHeight': 'child', 'childNum': 'number'})
.drop('midparentHeight', axis=1)
)
heights.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
'child'
heights on 'father'
's heights (MCAR)¶'child'
heights dependent on the 'father'
column?'father'
when 'child'
is missing.'father'
when 'child'
is not missing.'child'
looks to be independent of 'father'
.Aside: In util.py
, there are several functions that we've created to help us with this lecture.
make_mcar
takes in a dataset and intentionally drops values from a column such that they are MCAR.make_mar
does the same for MAR.# Generating MCAR data.
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mcar.isna().mean()
family 0.0 father 0.0 mother 0.0 children 0.0 number 0.0 gender 0.0 child 0.5 dtype: float64
'child'
heights on 'father'
's heights (MCAR)¶heights_mcar['child_missing'] = heights_mcar['child'].isna()
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MCAR example)")
'child'
heights on 'father'
's heights (MCAR)¶util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MCAR example)")
The ks_2samp
function from scipy.stats
can do the entire permutation test for us, if we want to use the K-S statistic!
(If we want to use the difference of means, we'd have to run a for
-loop.)
# 'father' when 'child' is missing.
father_ch_mis = heights_mcar.loc[heights_mcar['child_missing'], 'father']
# 'father' when 'child' is not missing.
father_ch_not_mis = heights_mcar.loc[~heights_mcar['child_missing'], 'father']
from scipy.stats import ks_2samp
ks_2samp(father_ch_mis, father_ch_not_mis)
KstestResult(statistic=0.0728051391862955, pvalue=0.16824323619176823)
'child'
is truly unrelated to the distribution of 'father'
, then the chance of seeing two distributions that are as or more different than our two observed 'father'
distributions is 16.8%.'child'
is likely unrelated to the distribution of 'father'
.In this MCAR example, if we were to take the mean of the 'child'
column that contains missing values, is the result likely to:
util.create_kde_plotly(heights_mcar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MCAR example)")
'child'
heights on 'father'
's heights (MAR)¶'child'
heights dependent on the 'father'
column?'father'
.# Generating MAR data.
heights_mar = util.make_mar_on_num(heights, 'child', 'father', pct=0.75)
heights_mar.isna().mean()
family 0.000000 father 0.000000 mother 0.000000 children 0.000000 number 0.000000 gender 0.000000 child 0.749465 dtype: float64
'child'
heights on 'father'
's heights (MAR)¶heights_mar['child_missing'] = heights_mar['child'].isna()
util.create_kde_plotly(heights_mar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MAR example)")
'child'
heights tend to come from taller 'father'
s heights.In this MAR example, if we were to take the mean of the 'child'
column that contains missing values, is the result likely to:
util.create_kde_plotly(heights_mar[['child_missing', 'father']], 'child_missing', True, False, 'father',
"Father's Height by Missingness of Child Height (MAR example)")
To illustrate, let's generate two datasets with missing 'child'
heights – one in which the heights are MCAR, and one in which they are MAR dependent on 'gender'
(not 'father'
, as in our previous example).
In practice, you'll have to run permutation tests to determine the likely missingness mechanism first!
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mar = util.make_mar_on_cat(heights, 'child', 'gender', pct=0.5)
Below, we compute the means and standard deviations of the 'child'
column in all three datasets. Remember, .mean()
and .std()
ignore missing values.
util.multiple_describe({
'Original': heights,
'MCAR': heights_mcar,
'MAR': heights_mar
})
Mean | Standard Deviation | |
---|---|---|
Dataset | ||
Original | 66.745931 | 3.579251 |
MCAR | 66.640685 | 3.563299 |
MAR | 68.518844 | 3.115300 |
Observations:
'child'
mean (and SD) in the MCAR dataset is very close to the true 'child'
mean (and SD).'child'
mean in the MAR dataset is biased high.Imputation is the act of filling in missing data with plausable values. Ideally, imputation:
These are hard to do at the same time!
There are three main types of imputation, two of which we will focus on today:
Each has upsides and downsides, and each works differently with different types of missingness.
heights
dataset¶Let's look at two distributions:
'child'
column in heights
, where we have all the data.'child'
column in heights_mcar
, where some values are MCAR.# Look in util.py to see how multiple_kdes is defined.
util.multiple_kdes({'Original': heights, 'MCAR, Unfilled': heights_mcar})
'child'
heights are MCAR, the orange distribution, in which some values are missing, has roughly the same shape as the turquoise distribution, which has no missing values.Let's fill in missing values in heights_mcar['child']
with the mean of the observed 'child'
heights in heights_mcar['child']
.
heights_mcar['child'].head()
0 73.2 1 69.2 2 NaN 3 NaN 4 73.5 Name: child, dtype: float64
heights_mcar_mfilled = heights_mcar.fillna(heights_mcar['child'].mean())
heights_mcar_mfilled['child'].head()
0 73.200000 1 69.200000 2 66.640685 3 66.640685 4 73.500000 Name: child, dtype: float64
df_map = {'Original': heights, 'MCAR, Unfilled': heights_mcar, 'MCAR, Mean Imputed': heights_mcar_mfilled}
util.multiple_describe(df_map)
Mean | Standard Deviation | |
---|---|---|
Dataset | ||
Original | 66.745931 | 3.579251 |
MCAR, Unfilled | 66.640685 | 3.563299 |
MCAR, Mean Imputed | 66.640685 | 2.518282 |
Observations:
Let's visualize all three distributions: the original, the MCAR heights with missing values, and the mean-imputed MCAR heights.
util.multiple_kdes(df_map)
Takeaway: When data are MCAR and you impute with the mean:
heights
dataset¶'child'
column in heights
, where we have all the data.'child'
column in heights_mar
, where some values are MAR.util.multiple_kdes({'Original': heights, 'MAR, Unfilled': heights_mar})
The distributions are not very similar!
Remember that in reality, you won't get to see the turquoise distribution, which has no missing values – instead, you'll try to recreate it, using your sample with missing values.
Let's fill in missing values in heights_mar['child']
with the mean of the observed 'child'
heights in heights_mar['child']
and see what happens.
heights_mar['child'].head()
0 73.2 1 69.2 2 NaN 3 NaN 4 73.5 Name: child, dtype: float64
heights_mar_mfilled = heights_mar.fillna(heights_mar['child'].mean())
heights_mar_mfilled['child'].head()
0 73.200000 1 69.200000 2 68.518844 3 68.518844 4 73.500000 Name: child, dtype: float64
df_map = {'Original': heights, 'MAR, Unfilled': heights_mar, 'MAR, Mean Imputed': heights_mar_mfilled}
util.multiple_describe(df_map)
Mean | Standard Deviation | |
---|---|---|
Dataset | ||
Original | 66.745931 | 3.579251 |
MAR, Unfilled | 68.518844 | 3.115300 |
MAR, Mean Imputed | 68.518844 | 2.201669 |
Note that the latter two means are biased high.
Let's visualize all three distributions: the original, the MAR heights with missing values, and the mean-imputed MAR heights.
util.multiple_kdes(df_map)
Since the sample with MAR values was already biased high, mean imputation kept the sample biased – it did not bring the data closer to the data generating process.
With our single mean imputation strategy, the resulting female mean height is biased quite high.
pd.concat([
heights.groupby('gender')['child'].mean().rename('Original'),
heights_mar.groupby('gender')['child'].mean().rename('MAR, Unfilled'),
heights_mar_mfilled.groupby('gender')['child'].mean().rename('MAR, Mean Imputed')
], axis=1).T
gender | female | male |
---|---|---|
Original | 64.103974 | 69.234096 |
MAR, Unfilled | 64.218571 | 69.277078 |
MAR, Mean Imputed | 67.854342 | 69.144663 |
'child'
is dependent on 'gender'
, we can impute separately for each 'gender'
.'child'
height for a 'female'
child, impute their height with the mean observed 'female'
height.transform
returns!¶groupby('gender')
and use the transform
method.def mean_impute(ser):
return ser.fillna(ser.mean())
heights_mar_cond = heights_mar.groupby('gender')['child'].transform(mean_impute).to_frame()
heights_mar_cond['child'].head()
0 73.200000 1 69.200000 2 64.218571 3 64.218571 4 73.500000 Name: child, dtype: float64
df_map['MAR, Conditional Mean Imputed'] = heights_mar_cond
util.multiple_kdes(df_map)
The pink distribution does a better job of approximating the turquoise distribution than the purple distribution.
Imputing missing data in a column with the mean of the column:
The same is true with other statistics (e.g. median and mode).
np.random.choice
or .sample
.heights
dataset¶Step 1: Determine the number of missing values in the column of interest.
num_null = heights_mcar['child'].isna().sum()
num_null
467
Step 2: Sample that number of values from the observed values in the column of interest.
fill_values = np.random.choice(heights_mcar['child'].dropna(), num_null)
Step 3: Fill in the missing values with the sample from Step 2.
heights_mcar_pfilled = heights_mcar.copy()
heights_mcar_pfilled.loc[heights_mcar_pfilled['child'].isna(), 'child'] = fill_values
Let's look at the results.
df_map = {'Original': heights,
'MCAR, Unfilled': heights_mcar,
'MCAR, Probabilistically Imputed': heights_mcar_pfilled}
util.multiple_describe(df_map)
Mean | Standard Deviation | |
---|---|---|
Dataset | ||
Original | 66.745931 | 3.579251 |
MCAR, Unfilled | 66.640685 | 3.563299 |
MCAR, Probabilistically Imputed | 66.668308 | 3.474865 |
Variance is preserved!
util.multiple_kdes(df_map)
No spikes!
np.histogram
) to bin the data, then sample from the histogram.Steps:
Let's try this procedure out on the heights_mcar
dataset.
heights_mcar.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | NaN |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | NaN |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
# This function implements the 3-step process we studied earlier.
def create_imputed(col):
col = col.copy()
num_null = col.isna().sum()
fill_values = np.random.choice(col.dropna(), num_null)
col[col.isna()] = fill_values
return col
Each time we run the following cell, it generates a new imputed version of the 'child'
column.
create_imputed(heights_mcar['child']).head()
0 73.2 1 69.2 2 72.0 3 65.0 4 73.5 Name: child, dtype: float64
Let's run the above procedure 100 times.
mult_imp = pd.concat([create_imputed(heights_mcar['child']).rename(k) for k in range(100)], axis=1)
mult_imp.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | ... | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 |
1 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | ... | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 |
2 | 73.0 | 62.5 | 72.0 | 72.0 | 64.5 | 67.0 | 75.0 | 69.0 | 70.0 | 71.0 | ... | 70.5 | 68.0 | 65.5 | 69.2 | 67.5 | 70.0 | 63.0 | 62.0 | 68.5 | 63.5 |
3 | 66.0 | 67.0 | 64.0 | 72.0 | 68.0 | 64.0 | 65.0 | 63.0 | 65.0 | 68.5 | ... | 70.0 | 66.0 | 66.0 | 70.0 | 67.0 | 64.7 | 72.0 | 62.0 | 69.0 | 67.0 |
4 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | ... | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 |
5 rows × 100 columns
Let's plot some of the imputed columns on the previous slide.
# Random sample of 15 imputed columns.
mult_imp_sample = mult_imp.sample(15, axis=1)
fig = ff.create_distplot(mult_imp_sample.to_numpy().T, list(mult_imp_sample.columns), show_hist=False, show_rug=False)
fig.update_xaxes(title='child')
Let's look at the distribution of means across the imputed columns.
px.histogram(pd.DataFrame(mult_imp.mean()), nbins=15, histnorm='probability',
title='Distribution of Imputed Sample Means')
df = df.dropna()
.df[col] = df[col].fillna(df[col].mean())
.c1
, conditional on a second categorical column
c2
:means = df.groupby('c2').mean().to_dict()
imputed = df['c1'].apply(lambda x: means[x] if np.isnan(x) else x)
c2
, apply the MCAR procedure to the groups of df.groupby(c2)
.