from dsc80_utils import *
📣 Announcements 📣¶
- Good job on Project 1!
- Lab 3 due on Mon.
📆 Agenda¶
- Data scope
- What is hypothesis testing? (Review from DSC 10)
- Why is hypothesis testing confusing to learn?
- Hypothesis testing examples
- Coin flipping
- Total variation distance
- Permutation testing
- Student-submitted hypothesis testing questions
Data Scope¶
- Target population: All elements of the population you ultimately want to draw conclusions about.
- Access frame: All elements that are accessible for you for measurement and observation.
- Sample: Subset of the access frame that you observed / measured.
Example: Wikipedia Awards¶
- A 2012 paper asked: if we give awards to Wikipedia contributors, will they contribute more?
- Took top 1% of of Wikipedia contributors, excluded people who already received an award, then took a random sample of 200 contributors.
Example: Who will win the election?¶
- 2016 US Presidental Election: most pollsters predicted Clinton to win over Trump.
- Randomly selected people and asked them a question over the phone.
🔑 Key Idea: Random samples look like the access frame they were sampled from¶
- This enables statistical inference!
- But keep in mind, random samples look like their access frame, which can be different than the population itself.
What is Hypothesis Testing? (Review from DSC 10)¶
- One common way to use your sample to draw conclusions about your population (Understanding the World).
- Other common method is to use confidence intervals.
The Basics¶
- Experiment done, or found something interesting in data.
- Created a new vaccine, ran experiment comparing it to a placebo.
- Want to "prove" that your treatment works.
- Or your data actually show an interesting pattern, etc.
- So you want compare your data against a reasonable baseline.
- Hypothesis testing: way to quantitively describe how "different" your data is from a baseline.
- In DSC 10, $ p < 0.05 $ means that data looks different enough to act on.
Hypothesis Testing Setup¶
Why is Hypothesis Testing Hard to Learn?¶
- You'll probably be confused this time, too!
Philosophical reasons:
- It's like "proof by contradiction":
- If I want to show that my vaccine works, I consider a world where it doesn't (null hypothesis).
- Then, I "attack the baseline" by showing that under the null hypothesis my data would be very unlikely.
- Showing something is not true is a lot easier than showing something is true!
Technical reasons:
- Several choices to make:
- What a good null hypothesis is.
- Test statistic.
- How to simulate more samples from the null population.
- Whether to look at the left tail or right tail of sampling distribution.
- Good news: almost all hypothesis tests follow this format!
Example: Coin flipping¶
Coin flipping¶
Suppose that we flipped a coin 100 times, and we saw 59 heads. We want to show that the coin is biased in favor of heads:
Observation: We flipped a coin 100 times, and saw 59 heads and 41 tails.
Null Hypothesis: The coin is fair.
Alternative Hypothesis: The coin is biased in favor of heads.
Test Statistic: Number of heads, $N_H$.
Generating the null distribution¶
Now that we've chosen a test statistic, we need to generate the distribution of the test statistic under the assumption the null hypothesis is true, i.e. the null distribution.
This distribution will give us, for instance:
- The probability of seeing 4 heads in 100 flips of a fair coin.
- The probability of seeing at most 46 heads in 100 flips of a fair coin.
- The probability of seeing at least 59 heads in 100 flips of a fair coin.
Generating the null distribution, using math¶
The number of heads in 100 flips of a fair coin follows the $\text{Binomial(100, 0.5)}$ distribution, in which
$$P(\text{# heads} = k) = {100 \choose k} (0.5)^k{(1-0.5)^{100-k}} = {100 \choose k} 0.5^{100}$$from scipy.special import comb
def p_k_heads(k):
return comb(100, k) * (0.5) ** 100
The probability that we see at least 59 heads is then:
sum([p_k_heads(k) for k in range(59, 101)])
0.04431304005703377
Let's look at this distribution visually.
# Also, this line tells pandas to generate plotly plots by default!
pd.options.plotting.backend = 'plotly'
plot_df = pd.DataFrame().assign(k = range(101))
plot_df['p_k'] = p_k_heads(plot_df['k'])
plot_df['color'] = plot_df['k'].apply(lambda k: 'orange' if k >= 59 else 'blue')
fig = plot_df.plot(kind='bar', x='k', y='p_k', color='color', width=1000)
fig
# fig.add_annotation(text='This orange area is the p-value!', x=77, y=0.008, showarrow=False)
Making a decision¶
We saw that, in 100 flips of a fair coin, $P(\text{# heads} \geq 59)$ is only ~4.4%.
This is quite low – it suggests that our observed result is quite unlikely under the null.
As such, we will reject the null hypothesis – our observation is not consistent with the hypothesis that the coin is fair.
The null still may be true – it's possible that the coin we flipped was fair, and we just happened to see a rare result. For the same reason, we also cannot "accept" the alternative.
This probability – the probability of seeing a result at least as extreme as the observed, under the null hypothesis – is called the p-value.
- If the p-value is below a pre-defined cutoff (often 5%), we reject the null.
- Otherwise, we fail to reject the null.
⚠️ We can't "accept" the null!¶
Note that we are very careful in saying that we either reject the null or fail to reject the null.
Just because we fail to reject the null, it doesn't mean the null is true – we cannot "accept" it.
Example:
- Suppose there is a coin that is truly biased towards heads, with probability 0.55.
- We flip it 10 times and see 5 heads and 5 tails.
- If we conduct a hypothesis test where the null is that the coin is fair, we will fail to reject the null.
- But the null isn't true.
In our framework:¶
Generating the null distribution, using simulation¶
In the most recent example, we computed the true probability distribution of the test statistic under the null hypothesis.
We could only do this because we know that the number of heads in $N$ flips of a fair coin follows the $\text{Binomial}(N, 0.5)$ distribution.
Often, we'll pick test statistics for which we don't know the true probability distribution. In such cases, we'll have to simulate, as we did in DSC 10.
Simulations provide us with empirical distributions of test statistics; if we simulate with a large (>= 10,000) number of repetitions, the empirical distribution of the test statistic should look similar to the true probability distribution of the test statistic.
Generating the null distribution, using simulation¶
First, let's figure out how to perform one instance of the experiment – that is, how to flip 100 coins once. Recall, to sample from a categorical distribution, we use np.random.multinomial
.
# Flipping a fair coin 100 times.
# Interpret the result as [Heads, Tails].
np.random.multinomial(100, [0.5, 0.5])
array([53, 47])
Then, we can repeat it a large number of times.
# 100,000 times, we want to flip a coin 100 times.
results = []
for _ in range(100_000):
num_heads = np.random.multinomial(100, [0.5, 0.5])[0]
results.append(num_heads)
Each entry in results
is the number of heads in 100 simulated coin flips.
results[:10]
[50, 44, 48, 47, 47, 54, 38, 47, 48, 48]
Visualizing the empirical distribution of the test statistic¶
fig = px.histogram(pd.DataFrame(results), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of # Heads in 100 Flips of a Fair Coin')
fig.add_vline(x=59, line_color='red')
fig.update_layout(xaxis_range=[0, 100])
Again, we can compute the p-value, which is the probability of seeing a result as or more extreme than the observed, under the null.
(np.array(results) >= 59).mean()
0.04391
Note that this number is close, but not identical, to the true p-value we found before. That's because we computed this p-value using a simulation, and hence an approximation.
In our framework:¶
Reflection¶
Can we make things faster? 🏃¶
A mantra so far in this course has been avoid for
-loops whenever possible. That applies here, too.
np.random.multinomial
(and np.random.choice
) accepts a size
argument. By providing size=100_000
, we can tell numpy
to draw 100 elements from a uniform distribution, 100_000
times, without needing a for
-loop!
# An array with 100000 rows and 2 columns.
np.random.multinomial(100, [0.5, 0.5], size=100_000)
array([[55, 45], [45, 55], [48, 52], ..., [53, 47], [50, 50], [50, 50]])
# Just the first column of the above array. Note the iloc-like syntax.
np.random.multinomial(100, [0.5, 0.5], size=100_000)[:, 0]
array([45, 55, 38, ..., 53, 47, 50])
%%time
faster_results = np.random.multinomial(100, [0.5, 0.5], size=100_000)[:, 0]
CPU times: user 13 ms, sys: 1.67 ms, total: 14.7 ms Wall time: 12.6 ms
The above approach is orders of magnitude faster than the for
-loop approach! With that said, you are still allowed to use for
-loops for hypothesis (and permutation) tests on assignments.
%%time
# 100,000 times, we want to flip a coin 100 times.
results = []
for _ in range(100_000):
num_heads = np.random.multinomial(100, [0.5, 0.5])[0]
results.append(num_heads)
CPU times: user 1.61 s, sys: 19.5 ms, total: 1.63 s Wall time: 1.62 s
Choosing alternative hypotheses and test statistics¶
The alternative hypothesis we chose was the coin is biased in favor of heads, and the test statistic we chose was the number of heads, $N_H$.
We could've also chosen one the following options; each of them has the quality that large values point to one hypothesis, and small values point to the other:
- $\frac{N_H}{100}$, the proportion of heads.
- $N_H - 50$, the difference from the expected number of heads.
What if our alternative hypothesis was the coin is biased (either towards heads or tails)?
Absolute test statistics¶
For the alternative hypothesis "the coin is biased", one test statistic we could use is $|N_H - \frac{N}{2}|$, the absolute difference from the expected number of heads.
If this test statistic is large, it means that there were many more heads than expected, or many fewer heads than expected. If this test statistic is small, it means that the number of heads was close to expected.
For instance, suppose we flip 100 coins, and I tell you the absolute difference from the expected number of heads is 20.
Then, either we flipped 70 heads or 30 heads.
If our alternative hypothesis is that the coin was biased, then it doesn't matter in which direction it was biased, and this test statistic works.
But if our alternative hypothesis is that the coin was biased towards heads, then this is not helpful, because we don't know whether or not there were 70 heads (evidence for the alternative) or 30 heads (not evidence for the alternative).
Important¶
We'd like to choose a test statistic such that large values of the test statistic correspond to one hypothesis, and small values correspond to the other.
In other words, we'll try to avoid "two-tailed tests". Rough rule of thumb:
If the alternative hypothesis is "A > B", then the test statistic should measure differences and should not contain an absolute value.
If the alternative hypothesis is "A and B are different", then the test statistic should measure distances and should contain an absolute value.
Example: Total variation distance¶
eth = pd.DataFrame(
[['Asian', 0.15, 0.51],
['Black', 0.05, 0.02],
['Latino', 0.39, 0.16],
['White', 0.35, 0.2],
['Other', 0.06, 0.11]],
columns=['Ethnicity', 'California', 'UCSD']
).set_index('Ethnicity')
eth
California | UCSD | |
---|---|---|
Ethnicity | ||
Asian | 0.15 | 0.51 |
Black | 0.05 | 0.02 |
Latino | 0.39 | 0.16 |
White | 0.35 | 0.20 |
Other | 0.06 | 0.11 |
- We want to decide whether UCSD students were drawn at random from the state of California.
- The two categorical distributions above are clearly different. But how different are they?
Is the difference between the two distributions significant?¶
Let's establish our hypotheses.
- Null Hypothesis: UCSD students were selected at random from the population of California residents.
- Alternative Hypothesis: UCSD students were not selected at random from the population of California residents.
- Observation: Ethnic distribution of UCSD students.
- Test Statistic: We need a way of quantifying how different two categorical distributions are.
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD', barmode='group')
Total variation distance¶
The total variation distance (TVD) is a test statistic that describes the distance between two categorical distributions.
If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is
$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$def total_variation_distance(dist1, dist2):
'''Given two categorical distributions,
both sorted with same categories, calculates the TVD'''
return np.sum(np.abs(dist1 - dist2)) / 2
Let's compute the TVD between UCSD's ethnic distribution and California's ethnic distribution.
observed_tvd = total_variation_distance(eth['UCSD'], eth['California'])
observed_tvd
0.41000000000000003
The issue is we don't know whether this is a large value or a small value – we don't know where it lies in the distribution of TVDs under the null.
The plan¶
To conduct our hypothesis test, we will:
Repeatedly generate samples of size 30,000 (number of UCSD students) from the ethnic distribution of all of California.
Each time, compute the TVD between the simulated distribution and California's distribution.
This will generate an empirical distribution of TVDs, under the null.
Finally, determine whether the observed TVD is consistent with the empirical distribution of TVDs.
Generating one random sample¶
Again, to sample from a categorical distribution, we use np.random.multinomial
.
Important: We must sample from the "population" distribution here, which is the ethnic distribution of everyone in California.
# Number of students at UCSD in this example.
N_STUDENTS = 30_000
eth['California']
Ethnicity Asian 0.15 Black 0.05 Latino 0.39 White 0.35 Other 0.06 Name: California, dtype: float64
np.random.multinomial(N_STUDENTS, eth['California'])
array([ 4478, 1538, 11699, 10543, 1742])
np.random.multinomial(N_STUDENTS, eth['California']) / N_STUDENTS
array([0.15, 0.05, 0.39, 0.35, 0.06])
Generating many random samples¶
We could write a for
-loop to repeat the process on the previous slide repeatedly (and you can in labs and projects). However, we now know about the size
argument in np.random.multinomial
, so let's use that here.
num_reps = 100_000
eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=num_reps) / N_STUDENTS
eth_draws
array([[0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], ..., [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06], [0.15, 0.05, 0.39, 0.35, 0.06]])
eth_draws.shape
(100000, 5)
Notice that each row of eth_draws
sums to 1, because each row is a simulated categorical distribution.
Computing many TVDs, without a for
-loop¶
One issue is that the total_variation_distance
function we've defined won't work with eth_draws
(unless we use a for
-loop), so we'll have to compute the TVD again.
tvds = np.sum(np.abs(eth_draws - eth['California'].to_numpy()), axis=1) / 2
tvds
array([0. , 0. , 0. , ..., 0.01, 0. , 0. ])
Just to make sure we did things correctly, we can compute the TVD between the first row of eth_draws
and eth['California']
using our previous function.
# Note that this is the same as the first element in tvds.
total_variation_distance(eth_draws[0], eth['California'])
0.003033333333333346
Visualizing the empirical distribution of the test statistic¶
observed_tvd
0.41000000000000003
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=20, histnorm='probability',
title='Empirical Distribution of the TVD')
# fig.add_vline(x=observed_tvd, line_color='red')
fig