import pandas as pd
import numpy as np
import os
import seaborn as sns
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from ipywidgets import interact
.value_counts()
.We'll look at many examples, and cover the necessary theory along the way.
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
Let's recap the example we saw last time.
The number of heads in 100 flips of a fair coin follows the $\text{Binomial(100, 0.5)}$ distribution, in which
$$P(\text{# heads} = k) = {100 \choose k} (0.5)^k{(1-0.5)^{100-k}} = {100 \choose k} 0.5^{100}$$from scipy.special import comb
def p_k_heads(k):
return comb(100, k) * (0.5) ** 100
The probability that we see at least 59 heads is then:
sum([p_k_heads(k) for k in range(59, 101)])
0.04431304005703377
Let's look at this distribution visually.
plot_df = pd.DataFrame().assign(k = range(101))
plot_df['p_k'] = p_k_heads(plot_df['k'])
plot_df['color'] = plot_df['k'].apply(lambda k: 'orange' if k >= 59 else 'blue')
fig = plot_df.plot(kind='bar', x='k', y='p_k', color='color', width=1000)
fig.add_annotation(text='This red area is called the p-value!', x=77, y=0.008, showarrow=False)
We saw that, in 100 flips of a fair coin, $P(\text{# heads} \geq 59)$ is only ~4.4%.
First, let's figure out how to perform one instance of the experiment – that is, how to flip 100 coins once. Recall, to sample from a categorical distribution, we use np.random.multinomial
.
# Flipping a fair coin 100 times.
# Interpret the result as [Heads, Tails].
np.random.multinomial(100, [0.5, 0.5])
array([51, 49])
Then, we can repeat it a large number of times.
# 100,000 times, we want to flip a coin 100 times.
results = []
for _ in range(100_000):
num_heads = np.random.multinomial(100, [0.5, 0.5])[0]
results.append(num_heads)
Each entry in results
is the number of heads in 100 simulated coin flips.
results[:10]
[49, 44, 52, 50, 45, 50, 51, 46, 52, 41]
fig = px.histogram(pd.DataFrame(results), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Number of Heads in 100 Flips of a Fair Coin')
fig.add_vline(x=59, line_color='red')
fig.update_layout(xaxis_range=[0, 100])
Again, we can compute the p-value, which is the probability of seeing a result as or more extreme than the observed, under the null.
(np.array(results) >= 59).mean()
0.04605
Note that this number is close, but not identical, to the true p-value we found before. That's because we computed this p-value using a simulation, and hence an approximation.
A mantra so far in this course has been avoid for
-loops whenever possible. That applies here, too.
np.random.multinomial
(and np.random.choice
) accepts a size
argument. By providing size=100_000
, we can tell numpy
to draw 100 elements from a uniform distribution, 100_000
times, without needing a for
-loop!
# An array with 100000 rows and 2 columns.
np.random.multinomial(100, [0.5, 0.5], size=100_000)
array([[54, 46], [56, 44], [57, 43], ..., [48, 52], [49, 51], [51, 49]])
# Just the first column of the above array. Note the iloc-like syntax.
np.random.multinomial(100, [0.5, 0.5], size=100_000)[:, 0]
array([47, 58, 56, ..., 47, 47, 44])
%%time
faster_results = np.random.multinomial(100, [0.5, 0.5], size=100_000)[:, 0]
CPU times: user 8.53 ms, sys: 702 µs, total: 9.23 ms Wall time: 8.34 ms
The above approach is orders of magnitude faster than the for
-loop approach! With that said, you are still allowed to use for
-loops for hypothesis (and permutation) tests on assignments.
%%time
# 100,000 times, we want to flip a coin 100 times.
results = []
for _ in range(100_000):
num_heads = np.random.multinomial(100, [0.5, 0.5])[0]
results.append(num_heads)
CPU times: user 1.1 s, sys: 6.61 ms, total: 1.1 s Wall time: 1.1 s
For the alternative hypothesis "the coin is biased", one test statistic we could use is $|N_H - \frac{N}{2}|$, the absolute difference from the expected number of heads.
We'd like to choose a test statistic such that large values of the test statistic correspond to one hypothesis, and small values correspond to the other.
In other words, we'll try to avoid "two-tailed tests". Rough rule of thumb:
eth = pd.DataFrame([['Asian', 0.15, 0.51],
['Black', 0.05, 0.02],
['Latino', 0.39, 0.16],
['White', 0.35, 0.2],
['Other', 0.06, 0.11]],
columns=['Ethnicity', 'California', 'UCSD']).set_index('Ethnicity')
eth
California | UCSD | |
---|---|---|
Ethnicity | ||
Asian | 0.15 | 0.51 |
Black | 0.05 | 0.02 |
Latino | 0.39 | 0.16 |
White | 0.35 | 0.20 |
Other | 0.06 | 0.11 |
Let's establish our hypotheses.
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD', barmode='group')
The total variation distance (TVD) is a test statistic that describes the distance between two categorical distributions.
If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is
$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$def total_variation_distance(dist1, dist2):
'''Given two categorical distributions,
both sorted with same categories, calculates the TVD'''
return np.sum(np.abs(dist1 - dist2)) / 2
Let's compute the TVD between UCSD's ethnic distribution and California's ethnic distribution.
observed_tvd = total_variation_distance(eth['UCSD'], eth['California'])
observed_tvd
0.41000000000000003
The issue is we don't know whether this is a large value or a small value – we don't know where it lies in the distribution of TVDs under the null.
To conduct our hypothesis test, we will:
Again, to sample from a categorical distribution, we use np.random.multinomial
.
Important: We must sample from the "population" distribution here, which is the ethnic distribution of everyone in California.
# Number of students at UCSD in this example.
N_STUDENTS = 30_000
eth['California']
Ethnicity Asian 0.15 Black 0.05 Latino 0.39 White 0.35 Other 0.06 Name: California, dtype: float64
np.random.multinomial(N_STUDENTS, eth['California'])
array([ 4612, 1491, 11614, 10492, 1791])
np.random.multinomial(N_STUDENTS, eth['California']) / N_STUDENTS
array([0.15396667, 0.04876667, 0.38433333, 0.3511 , 0.06183333])
We could write a for
-loop to repeat the process on the previous slide repeatedly (and you can in labs and projects). However, we now know about the size
argument in np.random.multinomial
, so let's use that here.
num_reps = 100_000
eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=num_reps) / N_STUDENTS
eth_draws
array([[0.14826667, 0.0503 , 0.39193333, 0.3489 , 0.0606 ], [0.14983333, 0.05123333, 0.389 , 0.3492 , 0.06073333], [0.14873333, 0.04996667, 0.3886 , 0.3501 , 0.0626 ], ..., [0.1492 , 0.05046667, 0.39096667, 0.34876667, 0.0606 ], [0.15053333, 0.05133333, 0.39083333, 0.3466 , 0.0607 ], [0.14893333, 0.04913333, 0.38916667, 0.35086667, 0.0619 ]])
eth_draws.shape
(100000, 5)
Notice that each row of eth_draws
sums to 1, because each row is a simulated categorical distribution.
for
-loop¶One issue is that the total_variation_distance
function we've defined won't work with eth_draws
(unless we use a for
-loop), so we'll have to compute the TVD again.
tvds = np.sum(np.abs(eth_draws - eth['California'].to_numpy()), axis=1) / 2
tvds
array([0.00283333, 0.00196667, 0.0027 , ..., 0.00203333, 0.0034 , 0.00276667])
Just to make sure we did things correctly, we can compute the TVD between the first row of eth_draws
and eth['California']
using our previous function.
# Note that this is the same as the first element in tvds.
total_variation_distance(eth_draws[0], eth['California'])
0.0028333333333333335
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=20, histnorm='probability',
title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig
(np.array(tvds) >= observed_tvd).mean()
0.0
No, there's not a mistake in our code!
To assess whether an "observed sample" was drawn randomly from a known categorical distribution:
N_STUDENTS = 300
, N_STUDENTS = 30
, or N_STUDENTS=3
?At what value of N_STUDENTS
would we fail to reject the null (at a 0.05 p-value cutoff)?
def p_value_given_n_students(N_STUDENTS):
eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=num_reps) / N_STUDENTS
tvds = np.sum(np.abs(eth_draws - eth['California'].to_numpy()), axis=1) / 2
p_value = (tvds >= observed_tvd).mean()
return p_value
interact(p_value_given_n_students, N_STUDENTS=(1, 300));
To fail to reject the null, our sample size (that is, the number of students at UCSD) would have to be in the single digits.
Consider the penguins
dataset from a few lectures ago.
penguins = sns.load_dataset('penguins').dropna()
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
penguins.groupby('island')['bill_length_mm'].agg(['mean', 'count'])
mean | count | |
---|---|---|
island | ||
Biscoe | 45.248466 | 163 |
Dream | 44.221951 | 123 |
Torgersen | 39.038298 | 47 |
It appears that penguins on Torgersen Island have shorter bills on average than penguins on other islands.
Again, while you could do this with a for
-loop (and you can use a for
-loop for hypothesis tests in labs and projects), we'll use the faster size
approach here.
Instead of using np.random.multinomial
, which samples from a categorical distribution, we'll use np.random.choice
, which samples from a known sequence of values.
# Draws two samples of size 47 from penguins['bill_length_mm'].
# Question: Why must we sample with replacement here (or, more specifically, in the next cell)?
np.random.choice(penguins['bill_length_mm'], size=(2, 47))
array([[36.3, 36.4, 33.1, 39.1, 38.1, 38.8, 47.8, 50.8, 47.5, 49. , 38.8, 38.6, 37.6, 36. , 42.8, 39. , 37.6, 50.8, 39. , 39.3, 50.9, 42.6, 44.9, 39.5, 37.2, 49.2, 43.2, 49.3, 50.1, 39.6, 45.1, 38.8, 55.1, 59.6, 48.4, 50.2, 46.5, 39.6, 35.1, 44. , 41.8, 43.3, 49.2, 41.1, 42.2, 36. , 45.5], [46.8, 49.5, 34.5, 48.5, 49.6, 49. , 39. , 37.8, 48.5, 49.2, 50.7, 35.3, 40.7, 40.7, 37.7, 48.2, 41.4, 46.5, 46.6, 48.5, 39.6, 43.6, 50.5, 39.5, 45.7, 39.5, 46.7, 42.2, 42.5, 48.1, 38.1, 49.7, 35.6, 46.2, 39. , 41. , 35. , 41.1, 45.2, 43.8, 45.5, 42.7, 42.6, 37.7, 40.2, 58. , 39.6]])
# Draws 100000 samples of size 47 from penguins['bill_length_mm'].
num_reps = 100_000
averages = np.random.choice(penguins['bill_length_mm'], size=(num_reps, 47)).mean(axis=1)
averages
array([43.98510638, 43.93404255, 44.61489362, ..., 44.15957447, 44.51914894, 43.60212766])
fig = px.histogram(pd.DataFrame(averages), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the Average Bill Length in Samples of Size 47')
fig.add_vline(x=penguins.loc[penguins['island'] == 'Torgersen', 'bill_length_mm'].mean(), line_color='red')
It doesn't look like the average bill length of penguins on Torgersen Island came from the distribution of bill lengths of all penguins in our dataset.
There is a statistical tool you've learned about that would allow us to find the true probability distribution of the test statistic in this case. What is it?
penguins['bill_length_mm']
is roughly normal with mean penguins['bill_length_mm']
and standard deviation penguins['bill_length_mm'].std(ddof=0) / np.sqrt(47)
.
Faced with a question about the data raised by an observation...
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
It does not help us answer questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?
That's where permutation testing comes in.
Here are a few more slides with examples that we won't cover in lecture.
Recall, a null hypothesis is an initial or default belief as to how data were generated.
Often, but not always, the null hypothesis states there is no association or difference between variables or subpopulations, and that any observed differences were due to random chance.