In [1]:
from dsc80_utils import *
In [2]:
import lec08_utils as util
📣 Announcements 📣¶
- Good job on Project 2 checkpoint!
- Lab 4 due Monday.
- Midterm exam will happen next week on Thurs Nov 2.
📝 Midterm Exam¶
- Thurs, Nov 2 from 3:30-4:50pm in WLH 2005.
- Pen and paper only. No calculators, phones, or watches allowed.
- You are allowed to bring one double-sided 8.5" x 11" sheet of handwritten notes.
- No reference sheet given, unlike DSC 10!
- We will display clarifications and the time remaining during the exam.
- Covers Lectures 1-8, Labs 1-4, and Projects 1-2.
- To review problems from old exams, go to practice.dsc80.com.
- Also look at the Resources tab on the course website.
📆 Agenda¶
- Review of missingness mechanisms
- Deciding between MCAR and MAR using a hypothesis test
- The Kolmogorov-Smirnov test statistic
- Mean Imputation
- Probabilistic Imputation
- Multiple Imputation
Review: Missingness mechanisms¶
A good strategy is to assess missingness in the following order.
Review: Assessing missingness through data¶
Example: Heights¶
- Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
- The dataset does not contain any missing values – we will artifically introduce missing values such that the values are MCAR, for illustration.
In [3]:
heights = pd.read_csv('data/midparent.csv')
heights = heights.rename(columns={'childHeight': 'child'})
heights = heights[['father', 'mother', 'gender', 'child']]
heights.head()
Out[3]:
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
3 | 78.5 | 67.0 | female | 69.0 |
4 | 75.5 | 66.5 | male | 73.5 |
Simulating MCAR data¶
- We will make
'child'
MCAR by taking a random subset ofheights
and setting the corresponding'child'
heights tonp.NaN
. - This is equivalent to flipping a (biased) coin for each row.
- If heads, we delete the
'child'
height.
- If heads, we delete the
- You will not do this in practice!
In [4]:
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.NaN
In [5]:
heights_mcar.head(10)
Out[5]:
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | NaN |
... | ... | ... | ... | ... |
7 | 75.5 | 66.5 | female | NaN |
8 | 75.0 | 64.0 | male | 71.0 |
9 | 75.0 | 64.0 | female | 68.0 |
10 rows × 4 columns
In [6]:
heights_mcar.isna().mean()
Out[6]:
father 0.0 mother 0.0 gender 0.0 child 0.3 dtype: float64
Aside: Why is the value for 'child'
in the above Series not exactly 0.3?
Verifying that child heights are MCAR in heights_mcar
¶
- Each row of
heights_mcar
belongs to one of two groups:- Group 1:
'child'
is missing. - Group 2:
'child'
is not missing.
- Group 1:
In [7]:
heights_mcar['child_missing'] = heights_mcar['child'].isna()
heights_mcar.head()
Out[7]:
father | mother | gender | child | child_missing | |
---|---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 | False |
1 | 78.5 | 67.0 | female | 69.2 | False |
2 | 78.5 | 67.0 | female | NaN | True |
3 | 78.5 | 67.0 | female | 69.0 | False |
4 | 75.5 | 66.5 | male | 73.5 | False |
- We need to look at the distributions of every other column –
'gender'
,'mother'
, and'father'
– separately for these two groups, and check to see if they are similar.
Comparing null and non-null 'child'
distributions for 'gender'
¶
In [8]:
gender_dist = (
heights_mcar
.assign(child_missing=heights_mcar['child'].isna())
.pivot_table(index='gender', columns='child_missing', aggfunc='size')
)
# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']
gender_dist = gender_dist / gender_dist.sum()
gender_dist
Out[8]:
child_missing = False | child_missing = True | |
---|---|---|
gender | ||
female | 0.49 | 0.48 |
male | 0.51 | 0.52 |
Note that here, each column is a separate distribution that adds to 1.
The two columns look similar, which is evidence that
'child'
's missingness does not depend on'gender'
.- Knowing that the child is
'female'
doesn't make it any more or less likely that their height is missing than knowing if the child is'male'
.
- Knowing that the child is
Comparing null and non-null 'child'
distributions for 'gender'
¶
In the previous slide, we saw that the distribution of
'gender'
is similar whether or not'child'
is missing.To make precise what we mean by "similar", we can run a permutation test. We are comparing two distributions:
- The distribution of
'gender'
when'child'
is missing. - The distribution of
'gender'
when'child'
is not missing.
- The distribution of
What test statistic do we use to compare categorical distributions?
In [9]:
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height', barmode='group')