from dsc80_utils import *
Announcements 📣¶
- Project 2 checkpoint due tomorrow. No extensions on project checkpoints!
- Lab 5 is due on Wed, May 1.
- Scores for Lab 1, 2, 3, and Project 1 are available on Gradescope.
Midterm Exam 📝¶
Thursday, May 2nd during (your official) lecture time in Peterson Hall 103.
- Pen and paper only. No calculators, phones, or watches allowed.
- You will be assigned a seat!
- You are allowed to bring one double-sided 8.5" x 11" sheet of handwritten notes.
- No reference sheet given, unlike DSC 10!
- We will display clarifications and the time remaining during the exam.
- Covers Lectures 1-8 and all related assignments.
- To review problems from old exams, go to practice.dsc80.com.
- Also look at the Resources tab on the course website.
Agenda 📆¶
- Review: Missingness mechanisms.
- Identifying missingness mechanisms in data.
- How do we decide between MCAR and MAR using a permutation test?
- The Kolmogorov-Smirnov test statistic.
- Imputation.
- Mean imputation.
- Probabilistic imputation.
Review: Missingness mechanisms¶
Flowchart¶
A good strategy is to assess missingness in the following order.
Question 🤔 (Answer at q.dsc80.com)
Taken from the Winter 2023 DSC 80 Midterm Exam.
The DataFrame tv_excl
contains all of the information we have for TV shows that are only available for streaming on a single streaming service.
Given no other information other than a TV show’s "Title"
and "IMDb"
rating, what is the most likely missingness mechanism of the "IMDb"
column?
A. Missing by design
B. Not missing at random
C. Missing at random
D. Missing completely at random
Question 🤔 (Answer at q.dsc80.com)
Taken from the Winter 2023 DSC 80 Midterm Exam.
Now, suppose we discover that the median "Rotten Tomatoes"
rating among TV shows with a missing "IMDb"
rating is a 13, while the median "Rotten Tomatoes"
rating among TV shows with a present "IMDb"
rating is a 52.
Given this information, what is the most likely missingness mechanism of the "IMDb"
column?
A. Missing by design
B. Not missing at random
C. Missing at random
D. Missing completely at random
Question 🤔 (Answer at q.dsc80.com)
Suppose Sam collects the blood pressures of 30 people in January. Then, in February, he asks a subset of the people to come back for a second reading (which means that there are missing blood pressures for February). What are the missing mechanisms for the blood pressures in February in the following situations?
- Sam uses
np.random.choice()
to select 7 individuals from January. - Sam asks the individuals who had hypertension (blood pressure > 140) in January to come back in February.
- Sam asks everyone to come back for a second reading in February, but only records the data for people who had hypertension (blood pressure > 140).
Identifying missingness mechanisms in data¶
Example: Heights¶
- Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
- The dataset does not contain any missing values – we will artifically introduce missing values such that the values are MCAR, for illustration.
heights_path = Path('data') / 'midparent.csv'
heights = pd.read_csv(heights_path).rename(columns={'childHeight': 'child'})[['father', 'mother', 'gender', 'child']]
heights.head()
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
3 | 78.5 | 67.0 | female | 69.0 |
4 | 75.5 | 66.5 | male | 73.5 |
Simulating MCAR data¶
- We will make
'child'
MCAR by taking a random subset ofheights
and setting the corresponding'child'
heights tonp.NaN
. - This is equivalent to flipping a (biased) coin for each row.
- If heads, we delete the
'child'
height.
- If heads, we delete the
- You will not do this in practice!
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.NaN
heights_mcar.head(10)
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | NaN |
... | ... | ... | ... | ... |
7 | 75.5 | 66.5 | female | NaN |
8 | 75.0 | 64.0 | male | 71.0 |
9 | 75.0 | 64.0 | female | 68.0 |
10 rows × 4 columns
heights_mcar.isna().mean()
father 0.0 mother 0.0 gender 0.0 child 0.3 dtype: float64
Verifying that child heights are MCAR in heights_mcar
¶
- Each row of
heights_mcar
belongs to one of two groups:- Group 1:
'child'
is missing. - Group 2:
'child'
is not missing.
- Group 1:
heights_mcar['child_missing'] = heights_mcar['child'].isna()
heights_mcar.head()
father | mother | gender | child | child_missing | |
---|---|---|---|---|---|
0 | 78.5 | 67.0 | male | 73.2 | False |
1 | 78.5 | 67.0 | female | 69.2 | False |
2 | 78.5 | 67.0 | female | NaN | True |
3 | 78.5 | 67.0 | female | 69.0 | False |
4 | 75.5 | 66.5 | male | 73.5 | False |
- We need to look at the distributions of every other column –
'gender'
,'mother'
, and'father'
– separately for these two groups, and check to see if they are similar.
gender_dist = (
heights_mcar
.assign(child_missing=heights_mcar['child'].isna())
.pivot_table(index='gender', columns='child_missing', aggfunc='size')
)
# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']
gender_dist = gender_dist / gender_dist.sum()
gender_dist
child_missing = False | child_missing = True | |
---|---|---|
gender | ||
female | 0.49 | 0.48 |
male | 0.51 | 0.52 |
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height (MCAR Example)', barmode='group')