from dsc80_utils import *
Lecture 6 – Hypothesis Testing¶
DSC 80, Spring 2024¶
In case you need a review from DSC 10, I've made a Pre-Lecture Review for this lecture.
Announcements 📣¶
- Project 1 is due tomorrow, April 19th.
- Lab 3 is due on Wed, Apr 24th.
- DSC undergraduate town hall is Monday, Apr 22 1-3pm in the HDSI 1st floor MPR.
- Q&A with faculty about undergrad program + mixer with faculty afterwards.
- Cookies and snacks provided!
Agenda 📆¶
- Data scope.
- Overview of hypothesis testing.
- Example: Total variation distance.
- Permutation testing.
- Example: Birth weight and smoking 🚬.
- Example (that you'll read on your own): Permutation testing meets TVD.
Why are we learning hypothesis testing again?¶
You may say,
Didn't we already learn this in DSC 10?
Yes, but:
It's an important concept, but one that's confusing to understand the first time you learn about it.
In addition, in order to properly handle missing values (next lecture), we need to learn how to identify different missingness mechanisms. Doing so requires performing a hypothesis test.
Data scope¶
Where are we in the data science lifecycle?¶
Hypothesis testing is a tool for helping us understand the world (some population), given our understanding of the data (some sample).
Data scope¶
Statistical inference: The practice of drawing conclusions about a population, given a sample.
Target population: All elements of the population you ultimately want to draw conclusions about.
Access frame: All elements that are accessible for you for measurement and observation.
Sample: The subset of the access frame that you actually measured / observed.
Example: Wikipedia awards¶
A 2012 paper asked:
If we give awards to Wikipedia contributors, will they contribute more?
To test this question, they took the top 1% of all Wikipedia contributors, excluded those who already received an award, and then took a random sample of 200 contributors.
Example: Who will win the election?¶
In the 2016 US Presidental Election, most pollsters predicted Clinton to win over Trump, even though Trump ultimately won.
To poll, they randomly selected potential voters and asked them a question over the phone.
🔑 Key Idea: Random samples look like the access frame they were sampled from!¶
This enables statistical inference!
But keep in mind, random samples look like their access frame which can be different than the population itself.
Sampling in practice¶
In DSC 10, you used a few key functions/methods to draw samples from populations.
- To draw samples from a known sequence (e.g. array or Series), you used
np.random.choice
.
names = np.load(Path('data') / 'names.npy', allow_pickle=True)
# By default, the sampling is done WITH replacement.
np.random.choice(names, 10)
array(['Lin', 'Yeogyeong', 'Subika', 'Jesse', 'Aakash', 'Chenlong', 'David', 'Ethan', 'Seanna', 'Ailinna'], dtype=object)
# To sample WITHOUT replacement, set replace=False.
# This is known as "simple random sampling."
np.random.choice(names, 10, replace=False)
array(['Kening', 'Ethan', 'Diya', 'Seanna', 'Colin', 'Zening', 'Stephanie', 'Jiaye', 'Dylan', 'Jessica'], dtype=object)
- The DataFrame
.sample
method also allowed you to draw samples from a known sequence.
# Samples WITHOUT replacement by default (the opposite of np.random.choice).
pd.DataFrame(names, columns=['name']).sample(10)
name | |
---|---|
105 | Yiran |
48 | Kailey |
76 | Niha |
... | ... |
21 | David |
24 | Diego |
26 | Dylan |
10 rows × 1 columns
- To sample from a categorical distribution, you used
np.random.multinomial
. Note that in the cell below, we don't seearray([50, 50])
every time, and that's due to randomness!
# Draws 100 elements from a population in which 50% are group 0 and 50% are group 1.
# This sampling is done WITH replacement.
# In other words, each sampled element has a 50% chance of being group 0 and a 50% chance of being group 1.
np.random.multinomial(100, [0.5, 0.5])
array([50, 50])
Overview of hypothesis testing¶
What problem does hypothesis testing solve?¶
Suppose we've performed an experiment, or identified something interesting in our data.
Say we've created a new vaccine.
To assess its efficiency, we give one group the vaccine, and another a placebo.
We notice that the flu rate among those who received the vaccine is lower than among those who received the placebo (i.e. didn't receive the vaccine).
One possibility: the vaccine doesn't actually do anything, and by chance, those with the vaccine happened to have a lower flu rate.
Another possibility: receiving the vaccine made a difference – the flu rate among those who received the vaccine is lower than we'd expect due to random chance.
Hypothesis testing allows us to determine whether an observation is "significant."
Why hypothesis testing is difficult to learn¶
It's like "[proof by contradiction](https://brilliant.org/wiki/contradiction/#:~:text=Proof%20by%20contradiction%20(also%20known,the%20opposite%20must%20be%20true.)."
If I want to show that my vaccine works, I consider a world where it doesn't (null hypothesis).
Then, I show that under the null hypothesis my data would be very unlikely.
Why go through these mental hurdles? Showing something is not true is usually easier than showing something is true!
The hypothesis testing "recipe"¶
Faced with a question about the data raised by an observation...
Decide on null and alternate hypotheses.
- The null hypothesis should be a well-defined probability model that reflects the baseline you want to compare against.
- The alternative hypothesis should be the "alternate reality" that you suspect may be true.
Decide on a test statistic, such that a large observed statistic would point to one hypothesis and a small observed statistic would point to the other.
Compute an empirical distribution of the test statistic under the null by drawing samples from the null hypothesis' probability model.
Assess whether the observed test statistic is consistent with the empirical distribution of the test statistic by computing a p-value.
Question 🤔 (Answer at q.dsc80.com)
Complete Problem 10 from the Spring 2023 DSC 10 Final Exam with a neighbor. Submit your answers to q.dsc80.com, then reveal the answers.
Example: Total variation distance¶
eth = pd.DataFrame(
[['Asian', 0.15, 0.51],
['Black', 0.05, 0.02],
['Latino', 0.39, 0.16],
['White', 0.35, 0.2],
['Other', 0.06, 0.11]],
columns=['Ethnicity', 'California', 'UCSD']
).set_index('Ethnicity')
eth
California | UCSD | |
---|---|---|
Ethnicity | ||
Asian | 0.15 | 0.51 |
Black | 0.05 | 0.02 |
Latino | 0.39 | 0.16 |
White | 0.35 | 0.20 |
Other | 0.06 | 0.11 |
The two distributions above are clearly different.
One possibility: UCSD students do look like a random sample of California residents, and the distributions above look different purely due to random chance.
Another possibility: UCSD students don't look like a random sample of California residents, because the distributions above look too different.
Is the difference between the two distributions significant?¶
Let's establish our hypotheses.
- Null Hypothesis: UCSD students were selected at random from the population of California residents.
- Alternative Hypothesis: UCSD students were not selected at random from the population of California residents.
- Observation: Ethnic distribution of UCSD students.
- Test Statistic: We need a way of quantifying how different two categorical distributions are.
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD', barmode='group')
How can we summarize the difference, or distance, between these two distributions using just a single number?
Total variation distance¶
The total variation distance (TVD) is a test statistic that describes the distance between two categorical distributions.
If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is
$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k \big|a_i - b_i\big|$$Let's compute the TVD between UCSD's ethnic distribution and California's ethnic distribution. We could define a function to do this (and you can use this in assignments):
def tvd(dist1, dist2):
return np.abs(dist1 - dist2).sum() / 2
But let's try and work on the eth
DataFrame directly, using the diff
method.
# The diff method finds the differences of consecutive elements in a Series.
pd.Series([4, 5, -2]).diff()
0 NaN 1 1.0 2 -7.0 dtype: float64
observed_tvd = eth.diff(axis=1).abs().sum().iloc[1] / 2
observed_tvd
0.41000000000000003
The issue is we don't know whether this is a large value or a small value – we don't know where it lies in the distribution of TVDs under the null.
The plan¶
To conduct our hypothesis test, we will:
Repeatedly generate samples of size 30,000 (the number of UCSD students) from the ethnic distribution of all of California.
Each time, compute the TVD between the simulated distribution and California's distribution.
This will generate an empirical distribution of TVDs, under the null.
Finally, determine whether the observed TVD (0.41) is consistent with the empirical distribution of TVDs.
Generating one random sample¶
Again, to sample from a categorical distribution, we use np.random.multinomial
.
Important: We must sample from the "population" distribution here, which is the ethnic distribution of everyone in California.
# Number of students at UCSD in this example.
N_STUDENTS = 30_000
eth['California']
Ethnicity Asian 0.15 Black 0.05 Latino 0.39 White 0.35 Other 0.06 Name: California, dtype: float64
np.random.multinomial(N_STUDENTS, eth['California'])
array([ 4446, 1517, 11713, 10613, 1711])
np.random.multinomial(N_STUDENTS, eth['California']) / N_STUDENTS
array([0.15, 0.05, 0.39, 0.35, 0.06])
Generating many random samples and computing TVDs, without a for
-loop¶
We could write a for
-loop to repeat the process on the previous slide repeatedly (and you can in labs and projects). However, the Pre-Lecture Review told us about the size
argument in np.random.multinomial
, so let's use that here.
eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=100_000) / N_STUDENTS
eth_draws
array([[0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], ..., [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06]])
eth_draws.shape
(100000, 5)
Notice that each row of eth_draws
sums to 1, because each row is a simulated categorical distribution.
# The values here appear rounded.
tvds = np.abs(eth_draws - eth['California'].to_numpy()).sum(axis=1) / 2
tvds
array([0., 0., 0., ..., 0., 0., 0.])
Visualizing the empirical distribution of the test statistic¶
observed_tvd
0.41000000000000003
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=20, histnorm='probability',
title='Empirical Distribution of the TVD')
fig