In [1]:

```
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(10, 5))
plt.rc('font', size=12)
```

- Discussion 2 is due (for extra credit) on
**Saturday, April 9th at 11:59PM**.- If you're looking for discussion recordings, look under DSC 80 Lecture A00 at podcast.ucsd.edu.

- Lab 2 is due on
**Monday, April 11th at 11:59PM**. - Project 1 is due on
**Thursday, April 14th at 11:59PM**.

- The theory of hypothesis testing.
- Total variation distance.
- Another example – penguin bill lengths 🐧.

Throughout this lecture, we will look at how to "speed up" our hypothesis tests by avoiding `for`

-loops entirely.

Recall, a

**null hypothesis**is an initial or default belief as to how data were generated.- The null hypothesis must be a
**probability model**, i.e. something that we can simulate under.

- The null hypothesis must be a
Often, but not always, the null hypothesis states there is no association or difference between variables or subpopulations, and that any observed differences were due to random chance.

- Examples:
- The coin was fair.
- The music preferences of Americans and Canadians are the same.
- The median number of Instagram followers of DSC majors is equal to the median number of Instagram followers of all students at UCSD.

- An
**alternative hypothesis**is a different viewpoint as to how data were generated. - The alternative hypothesis typically states that the difference between variables or subpopulations exists and is not due to random chance.
- Examples:
- The coin is biased towards heads.
- The coin is biased.
- The music preferences of Americans and Canadians are different.
- The median number of Instagram followers of DSC majors is greater than the median number of Instagram followers of all students at UCSD.

- A
**test statistic**is a number that we compute in each repetition of an experiment, to help us make a decision. - We'd like to choose a test statistic such that large values of the test statistic correspond to one hypothesis, and small values correspond to the other.
**In other words, we'll try to avoid "two-tailed tests".** - For example, suppose we flipped a coin $N$ times and we saw $N_H$ heads.

For the alternative hypothesis "the coin was biased towards heads", we could use:

- $N_H$ (number of heads).
- $\frac{N_H}{N}$ (proportion of heads).
- $N_H - \frac{N}{2}$ (difference from expected number of heads).

**If these test statistics are large, it means there were many heads. If they are small, it means there were few heads.**

For the alternative hypothesis "the coin was biased", we could use:

- $|N_H - \frac{N}{2}|$ (absolute difference from expected number of heads).

**If this test statistic is large, it means that there were many more heads than expected, or many fewer heads than expected. If this test statistic is small, it means that the number of heads was close to expected.**

- For instance, suppose we flip 100 coins, and I tell you the absolute difference from the expected number of heads is 20.
- Then, either we flipped 70 heads or 30 heads.
- If our alternative hypothesis is that the coin was biased, then it doesn't matter in which direction it was biased, and this test statistic works.
- But our alternative hypothesis is that the coin was biased towards heads, then this is not helpful, because we don't know whether or not there were 70 heads (evidence for the alternative) or 30 heads (not evidence for the alternative).

After choosing a test statistic, we need to compute the **distribution of the test statistic, under the assumption that the null hypothesis is true** ("under the null").

- In DSC 10 and 80, we do this through simulation, which means our null distributions are
**empirical**and our calculations are approximate. - In other courses, you may do this by-hand (e.g. in a coin flipping example, you could use the binomial distribution).

Below, we create a DataFrame of coin flips.

In [2]:

```
np.random.seed(42)
flips = pd.DataFrame(np.random.choice(['H', 'T'], p=[0.55, 0.45], size=(114, 1)), columns=['result'])
flips.head()
```

Out[2]:

result | |
---|---|

0 | H |

1 | T |

2 | T |

3 | T |

4 | H |

In [3]:

```
flips.shape
```

Out[3]:

(114, 1)

In [4]:

```
flips['result'].value_counts()
```

Out[4]:

H 68 T 46 Name: result, dtype: int64

There were 114 flips, 68 of which were heads.

**Null hypothesis:**The coin was fair, and any deviations were due to random chance.**Alternative hypothesis:**The coin was biased towards heads.**Test statistic:**Number of heads ($N_H$).

Steps:

- Compute the
**observed value**of the test statistic, i.e. the observed number of heads. (We already know this to be 68.) - Simulate values of the test statistic under the null, i.e. under the assumption that the coin was fair.
- Use the resulting distribution to calculate the (approximate) probability of seeing 68 or more heads in 114 coin flips, under the assumption the coin was fair.

In [5]:

```
# 100,000 times, we want to flip a coin 114 times
results = []
for _ in range(100000):
simulation = np.random.choice(['H', 'T'], size=114)
sim_heads = (simulation == 'H').sum() # Test statistic
results.append(sim_heads)
```

Each entry in `results`

is the number of heads in 114 simulated coin flips.

In [6]:

```
results[:10]
```

Out[6]:

[54, 51, 55, 52, 56, 66, 54, 66, 57, 57]

In [7]:

```
pd.Series(results).plot(kind='hist',
density=True,
bins=np.arange(35, 76, 1),
ec='w',
title='Number of Heads in 114 Flips of a Fair Coin');
obs = (flips['result'] == 'H').sum()
plt.axvline(x=obs, color='red', linewidth=2);
```

**Question:** Do you think the coin was fair?

In [8]:

```
(np.array(results) >= obs).mean()
```

Out[8]:

0.02481

- Under the assumption the coin is fair, the probability of seeing 68 or more heads is ~2.5%.
- This is called a
**p-value**.

- This is called a
- So either:
- The coin is fair and we saw a really rare event, or
- The coin is not fair.

- We need a
**cutoff**to determine whether to reject the null hypothesis, given this probability.

- The
**p-value**, or**observed significance level**, is the probability, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative. - If the p-value is below a pre-determined
**cutoff**, or**significance level**, we say that our observation is inconsistent with the null hypothesis and we**reject the null**.- 0.05 (significant) and 0.01 (highly significant) are common cutoffs.
- If the p-value is above the cutoff, we
**fail to reject the null**.

- Note that the cutoff is an
**error probability**.- If your cutoff is 0.05, then 5% of the time, you will incorrectly reject the null hypothesis.

- Note that we are very careful in saying that we either
**reject the null**or**fail to reject the null**. - If our observed data is close to what we'd expect under the null, we fail to reject the null, but we still don't know whether or not it is true!
- The null hypothesis is very, very specific, e.g. the coin was exactly fair.
- Example:
- Suppose there is a coin that is truly biased towards heads, with probability 0.55.
- We flip it 10 times and see 5 heads and 5 tails.
- If we conduct a hypothesis test where the null is that the coin is fair, we will fail to reject the null.
- But the null isn't true.

- To estimate the null distribution, we flipped a coin 114 times, 100,000 times. This involved 100,000 iterations of a
`for`

-loop. - We can avoid a
`for`

-loop entirely by using the`size`

argument in`np.random.choice`

(and in other`np`

random functions).- Idea: Generate a 2D array of coin flips with 114 columns and 100,000 rows. Each row corresponds to one run of our previous
`for`

-loop. - This will work if each trial (e.g. coin flip) is independent.
- We can then sum each row of the resulting array to get our distribution. However, this requires the entries in the array to be numbers – let's replace
`'H'`

with 1 and`'T'`

with 0.

- Idea: Generate a 2D array of coin flips with 114 columns and 100,000 rows. Each row corresponds to one run of our previous

In [9]:

```
flips.head()
```

Out[9]:

result | |
---|---|

0 | H |

1 | T |

2 | T |

3 | T |

4 | H |

In [10]:

```
flips_fast = flips.replace({'H': 1, 'T': 0})
flips_fast.head()
```

Out[10]:

result | |
---|---|

0 | 1 |

1 | 0 |

2 | 0 |

3 | 0 |

4 | 1 |

The following function takes in a value of `N`

and flips a fair coin `N * 114`

times.

In [11]:

```
def flip_114(N):
return np.random.choice([1, 0], size=(N, 114))
```

In [12]:

```
flip_114(50)
```

Out[12]:

array([[1, 0, 0, ..., 0, 0, 1], [0, 1, 1, ..., 1, 1, 0], [1, 1, 1, ..., 1, 1, 0], ..., [0, 1, 0, ..., 1, 1, 1], [1, 1, 1, ..., 1, 0, 1], [0, 1, 0, ..., 0, 1, 0]])

In [13]:

```
flip_114(50).shape
```

Out[13]:

(50, 114)

In [14]:

```
%%time
# Flips a fair coin 100,000 * 114 times
simulations = pd.DataFrame(flip_114(100000))
# Compute test statistics
# Note that axis=1 will take the sum of each row of 114, which is what we want
results_fast = simulations.sum(axis=1)
```

CPU times: user 75.2 ms, sys: 37.2 ms, total: 112 ms Wall time: 113 ms

In [15]:

```
pd.Series(results_fast).plot(kind='hist',
density=True,
bins=np.arange(35, 76, 1),
ec='w',
title='Number of Heads in 114 Flips of a Fair Coin');
plt.axvline(x=obs, color='red', linewidth=2);
```

In [16]:

```
eth = pd.DataFrame([['Asian', 0.15, 0.51],
['Black', 0.05, 0.02],
['Latino', 0.39, 0.16],
['White', 0.35, 0.2],
['Other', 0.06, 0.11]],
columns=['Ethnicity', 'California', 'UCSD']).set_index('Ethnicity')
eth
```

Out[16]:

California | UCSD | |
---|---|---|

Ethnicity | ||

Asian | 0.15 | 0.51 |

Black | 0.05 | 0.02 |

Latino | 0.39 | 0.16 |

White | 0.35 | 0.20 |

Other | 0.06 | 0.11 |

- We want to decide whether UCSD students were drawn at random from the state of California.
- The two
**categorical distributions**above are clearly different. But how different are they?

Let's establish our hypotheses.

**Null hypothesis:**UCSD students**were**selected at random from the population of California residents.**Alternative hypothesis:**UCSD students**were not**selected at random from the population of California residents.- Our
**observation**is the ethnic distribution of UCSD students. - We need a way of quantifying
**how different**two categorical distributions are.

In [17]:

```
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD');
```

The total variation distance (TVD) is a test statistic that describes the **distance between two categorical distributions**.

If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is

$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$In [18]:

```
def total_variation_distance(dist1, dist2):
'''Given two categorical distributions,
both sorted with same categories, calculates the TVD'''
return np.sum(np.abs(dist1 - dist2)) / 2
```

Below, we can compute the TVD between California's ethnic distribution and UCSD's ethnic distribution.

In [19]:

```
observed_tvd = total_variation_distance(eth['California'], eth['UCSD'])
observed_tvd
```

Out[19]:

0.41000000000000003

The issue is we don't know whether this is a large value or a small value.

To conduct our hypothesis test:

- Repeatedly generate samples of size 30,000 (number of UCSD students) from the ethnic distribution of all of California.
- Each time, compute the TVD between the simulated distribution and California's distribution.
**This will generate an empirical distribution of TVDs, under the null.**- Finally, determine whether the observed TVD is consistent with the empirical distribution of TVDs.

To sample from a categorical distribution, we use `np.random.multinomial`

.

In [20]:

```
np.random.multinomial(10, [0.5, 0.5])
```

Out[20]:

array([9, 1])

In [21]:

```
eth['California']
```

Out[21]:

Ethnicity Asian 0.15 Black 0.05 Latino 0.39 White 0.35 Other 0.06 Name: California, dtype: float64

In [22]:

```
np.random.multinomial(30000, eth['California'])
```

Out[22]:

array([ 4438, 1555, 11693, 10610, 1704])

In [23]:

```
np.random.multinomial(30000, eth['California']) / 30000
```

Out[23]:

array([0.14996667, 0.04803333, 0.3917 , 0.3493 , 0.061 ])

Now we need to repeat the process of creating samples, many, many times.

In [24]:

```
num_reps = 10000
tvds = []
for i in np.arange(num_reps):
random_sample = np.random.multinomial(30000, eth['California']) / 30000
new_tvd = total_variation_distance(eth['California'], random_sample)
tvds.append(new_tvd)
```

In [25]:

```
pd.Series(tvds).plot(kind='hist',
density=True,
ec='w',
title='Simulated TVDs');
plt.axvline(x=observed_tvd, color='red', linewidth=2);
```

- The chance that the observed TVD came from the distribution of TVDs under the null is essentially 0.
- This matches our intuition from the start – the two distributions looked very different to begin with. But now we're quite sure the difference can't be explained solely due to chance.

Again, we can get rid of the loop by using the `size`

argument!

In [26]:

```
eth_draws = np.random.multinomial(30000, eth['California'], size=num_reps) / 30000
eth_draws
```

Out[26]:

array([[0.1509 , 0.0478 , 0.3896 , 0.352 , 0.0597 ], [0.15026667, 0.0525 , 0.38723333, 0.35163333, 0.05836667], [0.15193333, 0.05166667, 0.38633333, 0.3499 , 0.06016667], ..., [0.15193333, 0.0506 , 0.39486667, 0.3439 , 0.0587 ], [0.14936667, 0.0477 , 0.39323333, 0.35003333, 0.05966667], [0.15036667, 0.0492 , 0.38883333, 0.354 , 0.0576 ]])

- Notice that each row of
`eth_draws`

sums to 1, because each row is a simulated categorical distribution. `eth_draws`

has`num_reps`

(10,000) rows.

In [27]:

```
eth_draws.shape
```

Out[27]:

(10000, 5)

Our previous `total_variation_distance`

function won't work with our 2D array `eth_draws`

.

In [28]:

```
tvds_fast = np.sum(np.abs(eth_draws - eth['California'].to_numpy()), axis=1) / 2
tvds_fast
```

Out[28]:

array([0.0029 , 0.0044 , 0.00376667, ..., 0.0074 , 0.00326667, 0.00436667])

In [29]:

```
pd.Series(tvds).plot(kind='hist',
density=True,
ec='w',
title='Simulated TVDs (without using a loop)');
# plt.axvline(x=observed_tvd, color='red', linewidth=2);
```

To assess whether an "observed sample" was drawn randomly from a known categorical distribution:

- Use the TVD as the test statistic because it measures the distance between two categorical distributions.
- Sample at random from the population. Compute the TVD between each random sample and the known distribution to get an idea for what reasonable deviations from the eligible pool look like. Repeat this process many, many times.
- Compare:
- the empirical distribution of TVDs, with
- the observed TVD from the sample.

In [30]:

```
import seaborn as sns
```

In [31]:

```
penguins = sns.load_dataset('penguins').dropna()
penguins.head()
```

Out[31]:

species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|

0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |

1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |

2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |

4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |

5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |

We will learn about the `groupby`

method next week.

In [32]:

```
penguins.groupby('island')['bill_length_mm'].agg(['mean', 'count'])
```

Out[32]:

mean | count | |
---|---|---|

island | ||

Biscoe | 45.248466 | 163 |

Dream | 44.221951 | 123 |

Torgersen | 39.038298 | 47 |

It appears that penguins on Torgersen Island have shorter bills on average than penguins on other islands. Could this have happened due to random chance?

**Null hypothesis:**Island and bill length**are not**related – the low average bill length of Torgersen Island penguins is due to chance alone.- In other words, if we picked 47 penguins randomly from the population of 333 penguins, it is reasonable to see an average this low.

**Alternative hypothesis:**Island and bill length**are**related – the low average bill length of Torgersen Island penguins is not due to chance alone.

- The null hypothesis states that the 47 bill lengths of Torgersen Island penguins were drawn uniformly at random from the 333 bill lengths in the population.
- That is, if we repeatedly sampled groups of 47 penguins from the population and computed their average bill length, it would not be uncommon to see an average bill length this low.
**Plan:**Repeatedly sample (without replacement) 47 penguins from the population and**compute their average bill length**, and see where the Torgersen Island average bill length lies in this distribution.- Average bill length is our
**test statistic**. - This is not a test statistic we've used in this lecture yet (and this is what separates this example from previous examples).

- Average bill length is our

In [33]:

```
num_reps
```

Out[33]:

10000

In [34]:

```
averages = []
for i in np.arange(num_reps):
random_sample = penguins.sample(47)
new_average = random_sample['bill_length_mm'].mean()
averages.append(new_average)
```

In [35]:

```
pd.Series(averages).plot(kind='hist',
density=True,
bins=30,
ec='w',
title='Average Bill Lengths in Samples of Size 47');
observed_average = penguins.groupby('island').mean().loc['Torgersen', 'bill_length_mm']
plt.axvline(x=observed_average, color='red', linewidth=2);
```

It doesn't look like the average bill length of Torgersen Island penguins came from the null distribution of average bill lengths.

Again (again), we can get rid of the loop by using the `size`

argument!

In [36]:

```
np.random.choice(penguins['bill_length_mm'], size=(5, 47))
```

Out[36]:

array([[35.7, 51. , 50.8, 53.5, 43.5, 38.1, 42.3, 32.1, 36.4, 42.5, 49.3, 35. , 42.2, 40.3, 42.9, 51.4, 35.2, 47.5, 43.6, 45.8, 45.7, 48.7, 45.4, 51.3, 40.8, 55.9, 46. , 51.3, 49.5, 51.5, 50. , 49. , 40.8, 45.7, 46.8, 50.5, 49.6, 43.5, 38.6, 46.9, 51.1, 49.5, 41.1, 50.5, 49.5, 42.8, 50.4], [46.5, 49. , 45.8, 48.1, 49. , 40.9, 41.3, 41.1, 38.2, 39. , 38.9, 39.2, 52. , 48.5, 38.1, 52.1, 44.9, 36.9, 40.7, 36. , 46.5, 47.5, 41.4, 37.6, 46.7, 41.8, 38.8, 50.1, 41.5, 50.7, 47.2, 37.7, 37. , 35.9, 46. , 32.1, 42.4, 43.5, 49.9, 39.2, 39. , 41.7, 50.2, 48.2, 43.5, 51.1, 46.4], [51.5, 38.6, 48.7, 36.2, 40.2, 43.5, 46.4, 48.8, 52.8, 41.4, 52.2, 46.1, 38.2, 47. , 41.1, 40.9, 47.6, 46.2, 50.2, 42.2, 39.6, 49.1, 46.7, 49.5, 45.1, 50.2, 49.3, 49.5, 46. , 50.8, 37.6, 43.3, 38.3, 46.5, 50.7, 36.3, 39.6, 54.2, 49. , 46.7, 39.7, 41.1, 45.2, 37. , 44. , 41.1, 54.2], [36.7, 51.3, 41.1, 36. , 42.9, 41.6, 46.8, 47. , 45.5, 39.2, 36.6, 45.8, 46. , 40.6, 43.5, 36.7, 49.5, 43.2, 50.8, 43.8, 46.5, 46. , 51.1, 37.9, 50. , 35.1, 50.1, 47.7, 39.5, 49. , 50.6, 51.4, 42.7, 42.5, 35.1, 46.2, 48.4, 45.9, 46.4, 36.6, 41.8, 37.2, 48.4, 42.3, 45.5, 51.5, 48.8], [42.2, 53.4, 45.5, 49. , 46.5, 45.5, 35.7, 50.2, 37. , 47.7, 37.9, 43.3, 36.6, 38.1, 47.2, 47.2, 45.5, 36. , 48.7, 52.5, 42.2, 34.5, 48.7, 36. , 42.2, 39.6, 44.1, 34.6, 50. , 50.8, 39.3, 38.2, 43.5, 37.6, 37. , 45.2, 42.5, 41.1, 36.3, 46.8, 50.5, 39.6, 45.2, 50. , 46.2, 35.7, 39.2]])

In [37]:

```
averages_fast = np.random.choice(penguins['bill_length_mm'], size=(num_reps, 47)).mean(axis=1)
averages_fast
```

Out[37]:

array([44.41276596, 44.35319149, 44.73404255, ..., 43.35957447, 43.25957447, 44.87021277])

In [38]:

```
pd.Series(averages_fast).plot(kind='hist',
density=True,
bins=30,
ec='w',
title='Average Bill Lengths in Samples of Size 47 (without using a loop)');
observed_average = penguins.groupby('island').mean().loc['Torgersen', 'bill_length_mm']
plt.axvline(x=observed_average, color='red', linewidth=2);
```

We get the same result, but much quicker!

Faced with a question about the data raised by an observation...

- Carefully pose the question as a testable "yes or no" hypothesis.
- Decide on a
**test statistic**that helps differentiate between instances that would affirm or reject the hypothesis. - Create a probability model for the data generating process that reflects the "known behavior" of the process.
- Simulate the data generating process using this probability model (the "
**null hypothesis**"). - Assess if the observation is consistent with the simulations by computing a
**p-value**.

We looked at three key examples:

- Coin flipping.
- Comparing categorical distributions.
- Comparing a subgroup statistic to a population parameter.

We looked at three key hypothesis testing examples:

- Comparing single-category counts/proportions.
- Coin flipping.

- Comparing categorical distributions.
- California ethnicities vs. UCSD ethnicities.

- Comparing a subgroup statistic to a population parameter.
- Torgersen Island average bill lengths vs. all average bill lengths.

- Comparing single-category counts/proportions.
**Next time:**Data granularity and the`groupby`

DataFrame method.