In [1]:

```
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(10, 5))
plt.rc('font', size=12)
```

- Project 2's checkpoint is due
**tomorrow at 11:59PM**, and the full project is due on**Saturday, April 30th at 11:59PM**. - Discussion section 4 (all about visualization 📊) is today, and the assignment is due for extra credit on
**Saturday, April 23rd at 11:59PM**. - Lab 4 is due on
**Monday, April 25th at 11:59PM**.- No hidden tests, just for this lab! Public notebook tests = Gradescope tests.
- Check here for clarifications.

- Grades are released for Lab 2 and Project 1! Look here for details.
- The Midterm Exam is
**in-class on Wednesday, April 27th.**More details to come.- It will cover Lectures 1-11, Labs 1-4, Projects 1-2, and Discussions 1-4.
- Unless you email Suraj beforehand, you
**must take the exam during the section you are enrolled in**. - A single, two-sided cheat sheet will be allowed.

- 🚨 If 80% of the class fills out the
**Mid-Quarter Survey**, then everyone will receive an extra point on the Midterm Exam. 🚨

- Speeding up permutation tests.
- Missingness mechanisms.

- Permutation tests help decide whether
**two samples came from the same distribution**. - In a permutation test, we simulate data under the null by
**shuffling**either group labels or numerical features.- In effect, this
**randomly assigns individuals to groups**.

- In effect, this
- If the two distributions are numeric, we use as our test statistic the difference in group means or medians.
- If the two distributions are categorical, we use as our test statistic the total variation distance (TVD).

- A permutation test, like all simulation-based hypothesis tests, generates an
**approximation**of the distribution of the test statistic.- If we found
**all**permutations, the distribution would be exact! - If there are $a$ elements in one group and $b$ in the other, the total number of permutations is ${a + b \choose a}$.
- If $a = 100$ and $b = 150$, there are more than ${250 \choose 100} \approx 6 \cdot 10^{71}$ permutations!

- If we found

- The more repetitions we use, the better our approximation will be.
- Unfortunately, our code is pretty slow, so we can't use many repetitions.

In [2]:

```
baby_fp = os.path.join('data', 'baby.csv')
baby = pd.read_csv(baby_fp)
smoking_and_birthweight = baby[['Maternal Smoker', 'Birth Weight']]
```

In [3]:

```
smoking_and_birthweight.head()
```

Out[3]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

3 | True | 108 |

4 | False | 136 |

Recall our permutation test from last class:

**Null hypothesis**: In the population, birth weights of smokers and non-smokers have the same distribution. The difference we saw was due to random chance.**Alternative hypothesis**: In the population, babies born to smokers have lower birth weights, on average.

We'll use 3000 repetitions instead of 500.

In [4]:

```
%%time
n_repetitions = 3000
differences = []
for _ in range(n_repetitions):
# Step 1: Shuffle the weights
shuffled_weights = (
smoking_and_birthweight['Birth Weight']
.sample(frac=1)
.reset_index(drop=True) # Be sure to reset the index! (Why?)
)
# Step 2: Put them in a DataFrame
shuffled = (
smoking_and_birthweight
.assign(**{'Shuffled Birth Weight': shuffled_weights})
)
# Step 3: Compute the test statistic
group_means = (
shuffled
.groupby('Maternal Smoker')
.mean()
.loc[:, 'Shuffled Birth Weight']
)
difference = group_means.diff().iloc[-1]
# Step 4: Store the result
differences.append(difference)
differences[:10]
```

CPU times: user 2.34 s, sys: 17.4 ms, total: 2.36 s Wall time: 2.36 s

Out[4]:

[1.9127656657068428, -1.2674101497630943, 0.031131831131830268, -0.8953760836113815, 0.9969895028718554, -0.2657799716623117, -0.015372427137137379, -0.08334018922255382, -0.31228422993129357, -1.6895257248198448]

**Improvement 1:** Use `np.random.permutation`

instead of `df.sample`

.

**Why?** This way, we don't need to shuffle index as well. This is how you ran permutation tests in DSC 10.

In [5]:

```
to_shuffle = smoking_and_birthweight.copy()
weights = to_shuffle['Birth Weight']
```

In [6]:

```
%%timeit
np.random.permutation(weights.values)
```

20 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]:

```
%%timeit
weights.sample(frac=1)
```

56.6 µs ± 401 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

**Improvement 2:** Don't use `assign`

; instead, add the new column in-place.

**Why?** This way, we don't create a new copy of our DataFrame on each iteration.

In [8]:

```
%%timeit
to_shuffle['Shuffled Birth Weight'] = np.random.permutation(weights.values)
```

35.3 µs ± 49.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]:

```
%%timeit
to_shuffle.assign(**{'Shuffled Birth Weight': np.random.permutation(weights.values)})
```

117 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Let's try out both of these improvements, again with 3000 repetitions.

In [10]:

```
%%time
n_repetitions = 3000
faster_differences = []
to_shuffle = smoking_and_birthweight.copy()
weights = to_shuffle['Birth Weight'].values
for _ in range(n_repetitions):
# Step 1: Shuffle the weights
shuffled_weights = np.random.permutation(weights)
# Step 2: Put them in a DataFrame
to_shuffle['Shuffled Birth Weight'] = shuffled_weights
# Step 3: Compute the test statistic
group_means = (
to_shuffle
.groupby('Maternal Smoker')
.mean()
.loc[:, 'Shuffled Birth Weight']
)
difference = group_means.diff().iloc[-1]
# Step 4: Store the result
faster_differences.append(difference)
faster_differences[:10]
```

CPU times: user 1.75 s, sys: 3.85 ms, total: 1.75 s Wall time: 1.75 s

Out[10]:

[0.44967015555249645, -0.2049667108490496, 0.1205630970336955, -1.7718024894495414, -0.9347258406081949, 0.14560385148621435, 1.0578027636851175, -0.4196017490135233, 1.5407315995551158, 0.38885689473924856]

In [11]:

```
bins = np.linspace(-4, 4, 20)
plt.hist(differences, density=True, ec='w', bins=bins, alpha=0.65, label='Original Test Statistics');
plt.hist(faster_differences, density=True, ec='w', bins=bins, alpha=0.65, label='Faster Test Statistics')
plt.legend();
```

The distribution of test statistics generated by the faster approach resembles the distribution of test statistics generated by the original approach. Both are doing the same thing!

- Both of our previous approaches involved calling
`groupby`

inside of a loop. - We can avoid
`groupby`

entirely! - Let's start by generating a Boolean array of size
`(3000, 1174)`

.- Each row will correspond to a single permutation of the
`'Maternal Smoker'`

(`bool`

) column.

- Each row will correspond to a single permutation of the

In [12]:

```
is_smoker = smoking_and_birthweight['Maternal Smoker'].values
weights = smoking_and_birthweight['Birth Weight'].values
```

In [13]:

```
is_smoker
```

Out[13]:

array([False, False, True, ..., True, False, False])

In [14]:

```
%%time
np.random.seed(24) # So that we get the same results each time (for lecture)
# We are still using a for-loop!
is_smoker_permutations = np.column_stack([
np.random.permutation(is_smoker)
for _ in range(3000)
]).T
```

CPU times: user 88 ms, sys: 3.51 ms, total: 91.5 ms Wall time: 90.4 ms

In `is_smoker_permutatons`

, each row is a new simulation.

`False`

means that baby comes from a non-smoking mother.`True`

means that baby comes from a smoking mother.

In [15]:

```
is_smoker_permutations
```

Out[15]:

array([[ True, True, False, ..., True, True, False], [ True, False, True, ..., True, False, False], [False, True, False, ..., True, False, True], ..., [False, False, True, ..., False, False, True], [ True, True, False, ..., True, False, True], [ True, True, True, ..., False, True, True]])

In [16]:

```
is_smoker_permutations.shape
```

Out[16]:

(3000, 1174)

Note that each row has 459 `True`

s and 715 `False`

s – it's just the order of them that's different.

In [17]:

```
is_smoker_permutations.sum(axis=1)
```

Out[17]:

array([459, 459, 459, ..., 459, 459, 459])

The first row of `is_smoker_permutations`

tells us that in this permutation, we'll assign baby 1 to "smoker", baby 2 to "smoker", baby 3 to "non-smoker", and so on.

In [18]:

```
is_smoker_permutations[0]
```

Out[18]:

array([ True, True, False, ..., True, True, False])

- If we multiply
`is_smoker_permutations`

by`weights`

, we will create a**new**(3000, 1174) array, in which:- the weights of babies assigned to "smoker" are present, and
- the weights of babies assigned to "non-smoker" are 0.

`is_smoker_permutations`

is a "mask".

First, let's try this on just the first permutation (i.e. the first row of `is_smoker_permutations`

).

In [19]:

```
weights * is_smoker_permutations[0]
```

Out[19]:

array([120, 113, 0, ..., 130, 125, 0])

Now, on all of `is_smoker_permutations`

:

In [20]:

```
weights * is_smoker_permutations
```

Out[20]:

array([[120, 113, 0, ..., 130, 125, 0], [120, 0, 128, ..., 130, 0, 0], [ 0, 113, 0, ..., 130, 0, 117], ..., [ 0, 0, 128, ..., 0, 0, 117], [120, 113, 0, ..., 130, 0, 117], [120, 113, 128, ..., 0, 125, 117]])

The mean of the **non-zero** entries in a row is the mean of the weights of "smoker" babies in that permutation.

*Why can't we use .mean(axis=1)?*

In [21]:

```
n_smokers = is_smoker.sum()
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_smokers
```

Out[21]:

array([118.94117647, 118.07625272, 120.38562092, ..., 119.76688453, 119.2745098 , 120.22222222])

In [22]:

```
mean_smokers.shape
```

Out[22]:

(3000,)

We also need to get the weights of the non-smokers in our permutations. We can do this by "inverting" the `is_smoker_permutations`

mask and performing the same calculations.

In [23]:

```
n_non_smokers = 1174 - n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
mean_non_smokers
```

Out[23]:

array([119.7972028 , 120.35244755, 118.86993007, ..., 119.26713287, 119.58321678, 118.97482517])

In [24]:

```
test_statistics = mean_smokers - mean_non_smokers
test_statistics
```

Out[24]:

array([-0.85602633, -2.27619483, 1.51569085, ..., 0.49975166, -0.30870698, 1.24739705])

In [25]:

```
%%time
is_smoker = smoking_and_birthweight['Maternal Smoker'].values
weights = smoking_and_birthweight['Birth Weight'].values
n_smokers = is_smoker.sum()
n_non_smokers = 1174 - n_smokers
is_smoker_permutations = np.column_stack([
np.random.permutation(is_smoker)
for _ in range(3000)
]).T
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
ultra_fast_differences = mean_smokers - mean_non_smokers
```

CPU times: user 96.5 ms, sys: 4.7 ms, total: 101 ms Wall time: 99.1 ms

In [26]:

```
plt.hist(differences, density=True, ec='w', bins=bins, alpha=0.65, label='Original Test Statistics');
plt.hist(ultra_fast_differences, density=True, ec='w', bins=bins, alpha=0.65, label='Ultra Fast Test Statistics')
plt.legend();
```

Again, the distribution of test statistics with the "ultra-fast" simulation is similar to the original distribution of test statistics.

_Good resources: course notes, Wikipedia, this textbook page_

- When studying a problem, we are interested in understanding the
**true model**in nature. - The data generating process is the "real-world" version of the model, that generates the data that we observe.
The recorded data is

**supposed**to "well-represent" the data generating process, and subsequently the true model.Example: Consider the upcoming Midterm Exam.

- The exam is meant to be a
**model**of your**true**knowledge of DSC 80 concepts. - The data generating process should give us a sense of your true knowledge, but is influenced by the specific questions on the exam, your preparation for the exam, whether or not you are sick on the day of the exam, etc.
- The recorded data consists of the final answers you write on the exam page.

- The exam is meant to be a

**Problem 1:**Your data is not representative, i.e. you collected a poor sample.- If the exam only asked questions about
`pivot_table`

, that would not give us an accurate picture of your understanding of DSC 80!

- If the exam only asked questions about
**Problem 2:**Some of the entries are missing.- If you left some questions blank, why?

We will focus on the second problem.

There are four key ways in which values can be missing. It is important to distinguish between these types so that we can correctly **handle** missing data (Lecture 13).

**Missing by design (MD)**.**Not missing at random (NMAR)**.- Also called "non-ignorable" (NI), especially in Lab 4.

**Missing at random (MAR)**.- Also called "conditionally ignorable".

**Missing completely at random (MCAR)**.- Also called "unconditionally ignorable".

Values in a column are missing by design if:

- the designers of the data collection process
**intentionally decided to not collect data in that column**, - because it can be recovered from other columns.

- the designers of the data collection process
If you can determine whether a value is missing solely using other columns, then the data is missing by design.

- For example:
`'Age4'`

is missing if and only if`'Number of People'`

is less than 4.

- For example:

- Refer to this StackExchange link for more examples.

**Example:** `'Car Type'`

and `'Car Colour'`

are missing if and only if `'Own a car?'`

is `'No'`

.

Not missing at random (NMAR).

- The chance that a value is missing
**depends on the actual missing value**! - Weird name, because it's still random...

- The chance that a value is missing
Missing at random (MAR).

- The chance that a value is missing
**depends on other columns**, but**not**the actual missing value itself.

- The chance that a value is missing
Missing completely at random (MCAR).

- The chance that a value is missing is
**completely independent**of- other columns, and
- the actual missing value.

- The chance that a value is missing is

Consider the following (contrived) example:

- We survey 100 people for their favorite color and birth month.
- We write their answers on index cards.
- On the left side, we write colors.
- On the right side, we write birth months.

- A (bad) dog takes the top 10 cards from the stack and chews off the right side (birth months).
- Now ten people are missing birth months!

We are now missing birth months for the first 10 people we surveyed. What is the missingness mechanism for birth months if:

- Cards were sorted by favorite color?
- Cards were sorted by birth month?
- Cards were shuffled?

Remember:

**Not missing at random (NMAR):**The chance that a value is missing**depends on the actual missing value**!**Missing at random (MAR):**The chance that a value is missing**depends on other columns**, but**not**the actual missing value itself.**Missing completely at random (MCAR):**The chance that a value is missing is**completely independent**of other columns and the actual missing value.

- If cards were sorted by favorite color, then:
- The fact that a card is missing a month is
**related to the favorite color**. - Since the missingness depends on another column, we say values are
**missing at random (MAR)**.- The missingness doesn't depend on the actual missing values – early months are no more likely to be missing than later months, for instance.

- The fact that a card is missing a month is

- If cards were sorted by birth month, then:
- The fact that a card is missing a month is
**related to the missing month**. - Since the missingness depends on the actual missing values – early months are more likely to be missing than later months – we say values are
**not missing at random (NMAR)**.

- The fact that a card is missing a month is

- If cards were shuffled, then:
- The fact that a card is missing a month is
**related to nothing**. - Since the missingness depends on nothing, we say values are
**missing completely at random (MCAR)**.

- The fact that a card is missing a month is

- In our contrived example, the distinction between NMAR, MAR, and MCAR was relatively clear.
- However, in more practical examples, it can be hard to distinguish between types of missingness.
- Domain knowledge is often needed to understand
**why**values might be missing.

- Data is NMAR if the chance that a value is missing
**depends on the actual missing value**!- It could
*also*depend on other columns.

- It could
- Another term for NMAR is "non-ignorable" – the fact that data is missing is data in and of itself that we cannot ignore.

**Example:**On an employment survey, people with really high incomes may be less likely to report their income.- If we
**ignore**missingness and compute the mean salary, our result will be**biased**low!

- If we

**Example:**A person doesn't take a drug test because they took drugs the day before.

- When data is NMAR, we must reason about why the data is missing using domain expertise on the data generating process – the other columns in our data won't help.

- Data is MAR if the chance that a value is missing
**depends on other columns**, but**not**the actual missing value itself. - Another term for MAR is "conditionally ignorable".

**Example:**An elementary school teacher keeps track of the health conditions of each student in their class. One day, a student doesn't show up for a test because they are at the hospital.- The fact that their test score is missing has nothing to do with the test score itself.
- But the teacher could have predicted that the score would have been missing given the other information they had about the student.

**Example:**People who work in the service industry may be less likely to report their income.

- Data is MCAR if the chance that a value is missing is
**completely independent**of other columns and the actual missing value. - Another term for MCAR is "unconditionally ignorable" – the fact that a value is missing is not noteworthy.

**Example:**There is a bank of 10 exam questions, and each student's Midterm Exam consists of a random sample of 5 questions. Each student will have missing scores for the other 5 questions; those scores will be MCAR.

**Example:**After the Midterm Exam, I accidentally spill boba on the top of the stack. Assuming that the exams are in a random order, the exam scores that are lost due to this still will be MCAR. (Hopefully this doesn't happen!)

- You can argue that many of these examples are NMAR, by arguing that the missingness depends on the value of the data that is missing.
- For example, if a student is hospitalized, they may have lots of health problems and may not have spent much time on school, leading to their test scores being worse.

- Fair point, but with that logic
*almost everything is NMAR*. - What we really care about is
**the main reason data is missing**. - If the other columns
**mostly**explain the missing value and missingness, treat it as MAR.

A good strategy is to assess missingness in the following order.

In each of the following examples, decide whether the missing data are MD, NMAR, MAR, or MCAR:

- A table for a medical study has columns for
`'gender'`

and`'age'`

..`'age'`

has missing values - Measurements from the Hubble Space Telescope are
**dropped during transmission**. - A table has a single column,
`'self-reported education level'`

,**which contains missing values**. - A table of grades contains three columns,
`'Version 1'`

,`'Version 2'`

, and`'Version 3'`

.**$\frac{2}{3}$ of the entries in the table are**`NaN`

.

- If a dataset contains missing values, it is likely not an accurate picture of the data generating process.
- By identifying missingness mechanisms, we can best
**fill in**missing values, to gain a better understanding of the DGP.

Suppose we have:

- A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is **missing completely at random** (MCAR) if

That is, adding information about the dataset doesn't change the likelihood data is missing!

Suppose we have:

- A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is **missing at random** (MCAR) if

That is, MAR data is **actually MCAR**, **conditional** on $Y_{obs}$.

Suppose we have:

- A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is **not missing at random** (NMAR) if

cannot be simplified. That is, in NMAR data, **missingness is dependent on the missing value** itself.

- Suppose I believe that the missingness mechanism of a column is NMAR, MAR, or MCAR.
- I've ruled out missing by design (a good first step).

- Can I check whether this is true, by looking at the data?

- We can't determine if data is NMAR just by looking at the data, as whether or not data is NMAR depends on the
**unobserved data**. - To establish if data is NMAR, we must:
**reason about the data generating process**, or- collect more data.

**Example:**Consider a dataset of survey data of students' self-reported happiness. The data contains PIDs and happiness scores; nothing else. Some happiness scores are missing.**Are happiness scores likely NMAR?**

- Data are MAR if the missingness only depends on
**observed**data. - After reasoning about the data generating process, if you establish that data is not NMAR, then it must be either MAR or MCAR.
- The more columns we have in our dataset, the "weaker the NMAR effect" is.
- Adding more columns -> controlling for more variables -> moving from NMAR to MAR.
**Example:**With no other columns, income in a census is NMAR. But once we look at location, education, and occupation, incomes are closer to being MAR.

- For data to be MCAR, the chance that values are missing should not depend on any other column or the values themselves.
**Example:**Consider a dataset of phones, in which we store the screen size and price of each phone.**Some prices are missing.**

Phone | Screen Size | Price |
---|---|---|

iPhone 13 | 6.06 | 999 |

Galaxy Z Fold 3 | 7.6 | NaN |

OnePlus 9 Pro | 6.7 | 799 |

iPhone 12 Pro Max | 6.68 | NaN |

- If prices are MCAR, then
**the distribution of screen size should be the same**for:- phones whose prices are missing, and
- phones whose prices aren't missing.

**We can use a permutation test to decide between MAR and MCAR!**We are asking the question, did these two samples come from the same underlying distribution?

**Missing by design (MD):**Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.**Not missing at random (NMAR, also called NI):**The chance that a value is missing**depends on the actual missing value**!**Missing at random (MAR):**The chance that a value is missing**depends on other columns**, but**not**the actual missing value itself.**Missing completely at random (MCAR):**The chance that a value is missing is**completely independent**of other columns and the actual missing value.**Next time:**How to verify if data are MAR vs. MCAR using a permutation test. A new test statistic.**Important:**Refer to the Flowchart when deciding between missingness types.