In [1]:

```
from dsc80_utils import *
```

In [2]:

```
import lec08_utils as util
```

## 📣 Announcements 📣¶

- Good job on Project 2 checkpoint!
- Lab 4 due Monday.
- Midterm exam will happen next week on Thurs Nov 2.

## 📝 Midterm Exam¶

- Thurs, Nov 2 from 3:30-4:50pm in WLH 2005.
- Pen and paper only. No calculators, phones, or watches allowed.
- You are allowed to bring one double-sided 8.5" x 11" sheet of handwritten notes.
- No reference sheet given, unlike DSC 10!

- We will display clarifications and the time remaining during the exam.
- Covers Lectures 1-8, Labs 1-4, and Projects 1-2.
- To review problems from old exams, go to practice.dsc80.com.
- Also look at the Resources tab on the course website.

## 📆 Agenda¶

- Review of missingness mechanisms
- Deciding between MCAR and MAR using a hypothesis test
- The Kolmogorov-Smirnov test statistic

- Mean Imputation
- Probabilistic Imputation
- Multiple Imputation

## Review: Missingness mechanisms¶

A good strategy is to assess missingness in the following order.

**Missing by design (MD)**

*Can I determine the missing value exactly by looking at the other columns?*🤔

**Not missing at random (NMAR)**

*Is there a good reason why the missingness depends on the values themselves?*🤔

**Missing at random (MAR)**

*Do other columns tell me anything about the likelihood that a value is missing?*🤔

**Missing completely at random (MCAR)**

*The missingness must not depend on other columns or the values themselves.*😄

## Review: Assessing missingness through data¶

### Example: Heights¶

- Let's load in Galton's dataset containing the heights of adult children and their parents (which you may have seen in DSC 10).
- The dataset does not contain any missing values – we will
**artifically introduce missing values**such that the values are MCAR, for illustration.

In [3]:

```
heights = pd.read_csv('data/midparent.csv')
heights = heights.rename(columns={'childHeight': 'child'})
heights = heights[['father', 'mother', 'gender', 'child']]
heights.head()
```

Out[3]:

father | mother | gender | child | |
---|---|---|---|---|

0 | 78.5 | 67.0 | male | 73.2 |

1 | 78.5 | 67.0 | female | 69.2 |

2 | 78.5 | 67.0 | female | 69.0 |

3 | 78.5 | 67.0 | female | 69.0 |

4 | 75.5 | 66.5 | male | 73.5 |

### Simulating MCAR data¶

- We will make
`'child'`

MCAR by taking a random subset of`heights`

and setting the corresponding`'child'`

heights to`np.NaN`

. - This is equivalent to flipping a (biased) coin for each row.
- If heads, we delete the
`'child'`

height.

- If heads, we delete the
**You will not do this in practice!**

In [4]:

```
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = heights.copy()
idx = heights_mcar.sample(frac=0.3).index
heights_mcar.loc[idx, 'child'] = np.NaN
```

In [5]:

```
heights_mcar.head(10)
```

Out[5]:

father | mother | gender | child | |
---|---|---|---|---|

0 | 78.5 | 67.0 | male | 73.2 |

1 | 78.5 | 67.0 | female | 69.2 |

2 | 78.5 | 67.0 | female | NaN |

... | ... | ... | ... | ... |

7 | 75.5 | 66.5 | female | NaN |

8 | 75.0 | 64.0 | male | 71.0 |

9 | 75.0 | 64.0 | female | 68.0 |

10 rows × 4 columns

In [6]:

```
heights_mcar.isna().mean()
```

Out[6]:

father 0.0 mother 0.0 gender 0.0 child 0.3 dtype: float64

Aside: Why is the value for `'child'`

in the above Series not exactly 0.3?

### Verifying that child heights are MCAR in `heights_mcar`

¶

- Each row of
`heights_mcar`

belongs to one of two**groups**:- Group 1:
`'child'`

is missing. - Group 2:
`'child'`

is not missing.

- Group 1:

In [7]:

```
heights_mcar['child_missing'] = heights_mcar['child'].isna()
heights_mcar.head()
```

Out[7]:

father | mother | gender | child | child_missing | |
---|---|---|---|---|---|

0 | 78.5 | 67.0 | male | 73.2 | False |

1 | 78.5 | 67.0 | female | 69.2 | False |

2 | 78.5 | 67.0 | female | NaN | True |

3 | 78.5 | 67.0 | female | 69.0 | False |

4 | 75.5 | 66.5 | male | 73.5 | False |

- We need to look at the distributions of every other column –
`'gender'`

,`'mother'`

, and`'father'`

– separately for these two groups, and check to see if they are similar.

### Comparing null and non-null `'child'`

distributions for `'gender'`

¶

In [8]:

```
gender_dist = (
heights_mcar
.assign(child_missing=heights_mcar['child'].isna())
.pivot_table(index='gender', columns='child_missing', aggfunc='size')
)
# Added just to make the resulting pivot table easier to read.
gender_dist.columns = ['child_missing = False', 'child_missing = True']
gender_dist = gender_dist / gender_dist.sum()
gender_dist
```

Out[8]:

child_missing = False | child_missing = True | |
---|---|---|

gender | ||

female | 0.49 | 0.48 |

male | 0.51 | 0.52 |

Note that here, each column is a separate distribution that adds to 1.

The two columns look similar, which is evidence that

`'child'`

's missingness does not depend on`'gender'`

.- Knowing that the child is
`'female'`

doesn't make it any more or less likely that their height is missing than knowing if the child is`'male'`

.

- Knowing that the child is

### Comparing null and non-null `'child'`

distributions for `'gender'`

¶

In the previous slide, we saw that the distribution of

`'gender'`

is similar whether or not`'child'`

is missing.To make precise what we mean by "similar", we can run a permutation test. We are comparing two distributions:

- The distribution of
`'gender'`

when`'child'`

is missing. - The distribution of
`'gender'`

when`'child'`

is not missing.

- The distribution of
What test statistic do we use to compare categorical distributions?

In [9]:

```
gender_dist.plot(kind='barh', title='Gender by Missingness of Child Height', barmode='group')
```