import pandas as pd
import numpy as np
from scipy import stats
import os
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(10, 5))
plt.rc('font', size=12)
import util
heights = pd.read_csv(os.path.join('data', 'heights.csv'))
heights = (
heights
.rename(columns={'childHeight': 'child', 'childNum': 'number'})
.drop('midparentHeight', axis=1)
)
heights.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
'child'
heights on 'father'
's heights (MCAR)¶'child'
heights dependent on the 'father'
column?'father'
when 'child'
is missing.'father'
when 'child'
is not missing.'child'
is not dependent on 'father'
.util.py
, there are several functions that we've created to help us with this lecture. make_mcar
takes in a dataset and intentionally drops values from a column such that they are MCAR. # Generating MCAR data
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mcar.isna().mean()
family 0.0 father 0.0 mother 0.0 children 0.0 number 0.0 gender 0.0 child 0.5 dtype: float64
heights_mcar['child_missing'] = heights_mcar['child'].isna()
(
heights_mcar
.groupby('child_missing')['father']
.plot(kind='kde', legend=True, title="Father's Height by Missingness of Child Height (MCAR example)")
);
ks_2samp
function from scipy.stats
can do the entire permutation test for us, if we want to use the Kolmogorov-Smirnov test statistic!for
-loop (see Lecture 10 and 12 for examples).# 'father' when 'child' is missing
father_ch_mis = heights_mcar.loc[heights_mcar['child_missing'], 'father']
# 'father' when 'child' is not missing
father_ch_not_mis = heights_mcar.loc[~heights_mcar['child_missing'], 'father']
stats.ks_2samp(father_ch_mis, father_ch_not_mis)
KstestResult(statistic=0.03640256959314775, pvalue=0.9168468032193087)
'child'
is truly unrelated to the distribution of 'father'
, then the chance of seeing two distributions that are as or more different than our two observed 'father'
distributions is 83%.'child'
is likely unrelated to the distribution of 'father'
.In this MCAR example, if we were to take the mean of the 'child'
column that contains missing values, is the result likely to:
(
heights_mcar
.groupby('child_missing')['father']
.plot(kind='kde', legend=True, title="Father's Height by Missingness of Child Height (MCAR example)")
);
'child'
heights on 'father'
's heights (MAR)¶'child'
heights dependent on the 'father'
column?# Generating MAR data
heights_mar = util.make_mar_on_num(heights, 'child', 'father', pct=0.75)
heights_mar.isna().mean()
family 0.000000 father 0.000000 mother 0.000000 children 0.000000 number 0.000000 gender 0.000000 child 0.749465 dtype: float64
heights_mar['child_missing'] = heights_mar['child'].isna()
(
heights_mar
.groupby('child_missing')['father']
.plot(kind='kde', legend=True, title="Father's Height by Missingness of Child Height (MAR example)")
);
'child'
heights tend to come from taller 'father'
s heights.# 'father' when 'child' is missing
father_ch_mis = heights_mar.loc[heights_mar['child_missing'], 'father']
# 'father' when 'child' is not missing
father_ch_not_mis = heights_mar.loc[~heights_mar['child_missing'], 'father']
stats.ks_2samp(father_ch_mis, father_ch_not_mis)
KstestResult(statistic=0.49496947496947497, pvalue=5.551115123125783e-16)
'child'
column is independent of the 'father'
column, and we conclude that 'child'
is MAR dependent on 'father'
.In this MAR example, if we were to take the mean of the 'child'
column that contains missing values, is the result likely to:
(
heights_mar
.groupby('child_missing')['father']
.plot(kind='kde', legend=True, title="Father's Height by Missingness of Child Height (MAR example)")
);
To illustrate, let's generate another dataset with missing values.
np.random.seed(42) # So that we get the same results each time (for lecture)
heights_mcar = util.make_mcar(heights, 'child', pct=0.50)
heights_mar = util.make_mar_on_cat(heights, 'child', 'gender', pct=0.50)
The true 'child'
mean with all of the data is as follows.
heights['child'].mean()
66.74593147751597
The 'child'
mean in the MCAR dataset is very close to the true 'child'
mean:
heights_mcar['child'].dropna().mean()
66.64068522483944
# Note that .mean() automatically drops nulls, so this expression is the same as the one above
heights_mcar['child'].mean()
66.64068522483944
The 'child'
mean in the MAR dataset is quite biased. Note that this is not the same example as before.
heights_mar['child'].mean()
68.51884368308356
Imputation is the act of filling in missing data with plausable values. Ideally, imputation:
These are hard to satisfy!
There are three main types of imputation, two of which we will focus on today:
Each has upsides and downsides, and each works differently with different types of missingness.
heights
dataset¶Let's look at two distributions:
'child'
heights.'child'
heights that have MCAR values.heights_mcar['child'].head()
0 73.2 1 69.2 2 NaN 3 NaN 4 73.5 Name: child, dtype: float64
# Note that this is **not** a density histogram!
plt.hist([heights['child'], heights_mcar['child'].dropna()])
plt.legend(['full data', 'missing (mcar)']);
'child'
heights.heights_mcar['child'].head()
0 73.2 1 69.2 2 NaN 3 NaN 4 73.5 Name: child, dtype: float64
heights_mcar_mfilled = heights_mcar.fillna(heights_mcar['child'].mean())
heights_mcar_mfilled['child'].head()
0 73.200000 1 69.200000 2 66.640685 3 66.640685 4 73.500000 Name: child, dtype: float64
print(
'mean (original): %f' % heights['child'].mean(),
'mean (missing): %f' % heights_mcar['child'].mean(),
'mean (mean imp): %f' % heights_mcar_mfilled['child'].mean(),
sep='\n'
)
mean (original): 66.745931 mean (missing): 66.640685 mean (mean imp): 66.640685
print(
'std (original): %f' % heights['child'].std(),
'std (missing): %f' % heights_mcar['child'].std(),
'std (mean imp): %f' % heights_mcar_mfilled['child'].std(),
sep='\n'
)
std (original): 3.579251 std (missing): 3.563299 std (mean imp): 2.518282
Let's take a look at all three distributions: the original, the MCAR heights with missing values, and the imputed MCAR heights.
plt.hist([heights['child'], heights_mcar['child'].dropna(), heights_mcar_mfilled['child']])
plt.legend(['full data', 'missing (mcar)', 'imputed']);
Takeaway: When data are MCAR and you impute with the mean:
heights
dataset¶np.random.seed(42) # So that we get the same results each time (for lecture)
heights_mar_cat = util.make_mar_on_cat(heights, 'child', 'gender', pct=0.50)
heights_mar_cat['child'].head()
0 NaN 1 69.2 2 69.0 3 69.0 4 NaN Name: child, dtype: float64
Again, let's look at two distributions:
'child'
heights.'child'
heights that have MAR values.# The observed vs true distribution
plt.hist([heights['child'], heights_mar_cat['child']])
plt.legend(['full data', 'missing (mar)']);
heights_mar_cat_mfilled = heights_mar_cat.fillna(heights_mar_cat['child'].mean())
heights_mar_cat_mfilled
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 64.999358 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.200000 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.000000 |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.000000 |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 64.999358 |
... | ... | ... | ... | ... | ... | ... | ... |
929 | 203 | 62.0 | 66.0 | 3 | 1 | male | 64.999358 |
930 | 203 | 62.0 | 66.0 | 3 | 2 | female | 62.000000 |
931 | 203 | 62.0 | 66.0 | 3 | 3 | female | 64.999358 |
932 | 204 | 62.5 | 63.0 | 2 | 1 | male | 64.999358 |
933 | 204 | 62.5 | 63.0 | 2 | 2 | female | 57.000000 |
934 rows × 7 columns
print(
'mean (original): %f' % heights['child'].mean(),
'mean (missing): %f' % heights_mar_cat['child'].mean(),
'mean (mean imp): %f' % heights_mar_cat_mfilled['child'].mean(),
sep='\n'
)
mean (original): 66.745931 mean (missing): 64.999358 mean (mean imp): 64.999358
print(
'std (original): %f' % heights.child.std(),
'std (missing): %f' % heights_mar_cat.child.std(),
'std (mean imp): %f' % heights_mar_cat_mfilled.child.std(),
sep='\n'
)
std (original): 3.579251 std (missing): 3.166655 std (mean imp): 2.237963
plt.hist([heights['child'], heights_mar_cat['child'], heights_mar_cat_mfilled['child']]);
plt.legend(['full data', 'missing (mar)', 'imputed']);
'child'
is dependent on 'gender'
, we can impute separately for each 'gender'
.'child'
height for a 'female'
child, impute their height with the mean observed 'female'
height.pd.concat([
heights.groupby('gender')['child'].mean().rename('full'),
heights_mar_cat.groupby('gender')['child'].mean().rename('missing (mar)'),
heights_mar_cat_mfilled.groupby('gender')['child'].mean().rename('naively imputed')
], axis=1)
full | missing (mar) | naively imputed | |
---|---|---|---|
gender | |||
female | 64.103974 | 64.011024 | 64.168110 |
male | 69.234096 | 69.377907 | 65.782217 |
Note that with our single mean imputation strategy, the resulting male mean height is biased quite low.
'child'
heights are MAR dependent on 'gender'
.heights['child'].fillna(heights['child'].mean())
.def mean_impute(ser):
return ser.fillna(ser.mean())
heights_mar_cat.groupby('gender')['child'].transform(mean_impute)
0 69.377907 1 69.200000 2 69.000000 3 69.000000 4 69.377907 ... 929 69.377907 930 62.000000 931 64.011024 932 69.377907 933 57.000000 Name: child, Length: 934, dtype: float64
Imputing missing data in a column with the mean of the column:
The same is true with other statistics (e.g. median and mode).
Hint: What does the distribution of incomes look like? Where is the mean/median?
.sample
.heights
dataset¶Steps:
Step 1: Figure out the number of missing values.
num_null = heights_mcar['child'].isna().sum()
num_null
467
Step 2: Sample that number of values from the observed dataset.
fill_values = heights_mcar.child.dropna().sample(num_null, replace=True)
fill_values
253 73.0 640 66.0 50 70.5 288 67.0 27 64.0 ... 655 73.5 855 61.0 930 62.0 16 69.0 691 64.0 Name: child, Length: 467, dtype: float64
Step 3: Fill in the missing values with the sample from Step 2.
# Find the positions where values in heights_mcar are missing
fill_values.index = heights_mcar.loc[heights_mcar['child'].isna()].index
# Fill in the missing values
heights_mcar_dfilled = heights_mcar.fillna({'child': fill_values.to_dict()}) # fill the vals
Let's look at the results.
print(
'mean (original): %f' % heights['child'].mean(),
'mean (missing): %f' % heights_mcar['child'].mean(),
'mean (distr imp): %f' % heights_mcar_dfilled['child'].mean(),
sep='\n'
)
mean (original): 66.745931 mean (missing): 66.640685 mean (distr imp): 66.659529
print(
'std (original): %f' % heights['child'].std(),
'std (missing): %f' % heights_mcar['child'].std(),
'std (distr imp): %f' % heights_mcar_dfilled['child'].std(),
sep='\n'
)
std (original): 3.579251 std (missing): 3.563299 std (distr imp): 3.501084
Variance is preserved!
plt.hist([heights['child'], heights_mcar['child'], heights_mcar_dfilled['child']], density=True);
plt.legend(['full data','missing (mcar)', 'distr imputed']);
No spikes!
If a value was never observed in the dataset, it will never be used to fill in a missing value.
Solution? Create a histogram (with np.histogram
) to bin the data, then sample from the histogram.
Question: How would we generalize this process for MAR data?
Steps:
Create several imputed versions of the data through a probabilistic procedure.
Then, estimate the parameters of interest for each imputed dataset.
Let's try this procedure out on the heights_mcar
dataset.
heights_mcar.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | NaN |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | NaN |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
# This function implements the 3-step process we studied earlier
def create_imputed(col):
num_null = col.isna().sum()
fill_values = col.dropna().sample(num_null, replace=True)
fill_values.index = col.loc[col.isna()].index
return col.fillna(fill_values.to_dict())
Each time we run the following cell, it generates a new imputed version of the 'child'
column.
create_imputed(heights_mcar['child']).head()
0 73.2 1 69.2 2 62.0 3 65.0 4 73.5 Name: child, dtype: float64
Let's run the above procedure 100 times.
mult_imp = pd.concat([create_imputed(heights_mcar['child']).rename(k) for k in range(100)], axis=1)
mult_imp.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | ... | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 |
1 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | ... | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 |
2 | 66.0 | 68.0 | 65.0 | 63.0 | 70.0 | 68.5 | 74.0 | 70.0 | 71.0 | 69.0 | ... | 68.7 | 68.0 | 70.0 | 62.0 | 65.0 | 67.0 | 65.5 | 71.0 | 70.2 | 72.7 |
3 | 70.5 | 64.0 | 62.0 | 70.0 | 69.0 | 63.0 | 64.5 | 63.0 | 64.5 | 66.5 | ... | 72.0 | 67.0 | 69.0 | 60.0 | 65.0 | 66.0 | 73.0 | 72.0 | 60.0 | 64.0 |
4 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | ... | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 |
5 rows × 100 columns
Let's plot some of the imputed columns above.
# Random sample of 15 imputed columns
mult_imp.sample(15, axis=1).plot(kind='kde', alpha=0.5, legend=False);
Let's look at the distribution of means across the imputed columns.
mult_imp.mean().plot(kind='hist', bins=20, ec='w', density=True);
.dropna()
..fillna(df[col].mean())
.c1
, conditional on a second column c2
:
means = df.groupby('c2').mean().to_dict()
imputed = df['c1'].apply(lambda x: means[x] if pd.isna(x) else x)
c2
, apply the MCAR procedure to the groups of df.groupby(c2)
.