Lecture 9 – Hypothesis Testing

DSC 80, Spring 2023

Agenda

We'll look at many examples, and cover the necessary theory along the way.

"Standard" hypothesis testing

"Standard" hypothesis testing helps us answer questions of the form:

I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

Example: Coin flipping

Recap: Coin flipping

Let's recap the example we saw last time.

Generating the null distribution

Generating the null distribution, using math

The number of heads in 100 flips of a fair coin follows the $\text{Binomial(100, 0.5)}$ distribution, in which

$$P(\text{# heads} = k) = {100 \choose k} (0.5)^k{(1-0.5)^{100-k}} = {100 \choose k} 0.5^{100}$$

The probability that we see at least 59 heads is then:

Let's look at this distribution visually.

Making a decision

We saw that, in 100 flips of a fair coin, $P(\text{# heads} \geq 59)$ is only ~4.4%.

⚠️ We can't "accept" the null!

Generating the null distribution, using simulation

Generating the null distribution, using simulation

First, let's figure out how to perform one instance of the experiment – that is, how to flip 100 coins once. Recall, to sample from a categorical distribution, we use np.random.multinomial.

Then, we can repeat it a large number of times.

Each entry in results is the number of heads in 100 simulated coin flips.

Visualizing the empirical distribution of the test statistic

Again, we can compute the p-value, which is the probability of seeing a result as or more extreme than the observed, under the null.

Note that this number is close, but not identical, to the true p-value we found before. That's because we computed this p-value using a simulation, and hence an approximation.

Reflection

Can we make things faster? 🏃

A mantra so far in this course has been avoid for-loops whenever possible. That applies here, too.

np.random.multinomial (and np.random.choice) accepts a size argument. By providing size=100_000, we can tell numpy to draw 100 elements from a uniform distribution, 100_000 times, without needing a for-loop!

The above approach is orders of magnitude faster than the for-loop approach! With that said, you are still allowed to use for-loops for hypothesis (and permutation) tests on assignments.

Choosing alternative hypotheses and test statistics

Absolute test statistics

For the alternative hypothesis "the coin is biased", one test statistic we could use is $|N_H - \frac{N}{2}|$, the absolute difference from the expected number of heads.

Important

We'd like to choose a test statistic such that large values of the test statistic correspond to one hypothesis, and small values correspond to the other.

In other words, we'll try to avoid "two-tailed tests". Rough rule of thumb:

Fun fact

Example: Total variation distance

Ethnic distribution of California vs. UCSD

The DataFrame below contains the ethnic breakdown of the state as a whole (source) and UCSD as of 2016 (source).

Is the difference between the two distributions significant?

Let's establish our hypotheses.

Total variation distance

The total variation distance (TVD) is a test statistic that describes the distance between two categorical distributions.

If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is

$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$

Let's compute the TVD between UCSD's ethnic distribution and California's ethnic distribution.

The issue is we don't know whether this is a large value or a small value – we don't know where it lies in the distribution of TVDs under the null.

The plan

To conduct our hypothesis test, we will:

Generating one random sample

Again, to sample from a categorical distribution, we use np.random.multinomial.

Important: We must sample from the "population" distribution here, which is the ethnic distribution of everyone in California.

Generating many random samples

We could write a for-loop to repeat the process on the previous slide repeatedly (and you can in labs and projects). However, we now know about the size argument in np.random.multinomial, so let's use that here.

Notice that each row of eth_draws sums to 1, because each row is a simulated categorical distribution.

Computing many TVDs, without a for-loop

One issue is that the total_variation_distance function we've defined won't work with eth_draws (unless we use a for-loop), so we'll have to compute the TVD again.

Just to make sure we did things correctly, we can compute the TVD between the first row of eth_draws and eth['California'] using our previous function.

Visualizing the empirical distribution of the test statistic

No, there's not a mistake in our code!

Conclusion

Summary of the method

To assess whether an "observed sample" was drawn randomly from a known categorical distribution:

Aside

Discussion Question

At what value of N_STUDENTS would we fail to reject the null (at a 0.05 p-value cutoff)?

To fail to reject the null, our sample size (that is, the number of students at UCSD) would have to be in the single digits.

Example: Penguins (again!)

(source)

Consider the penguins dataset from a few lectures ago.

Average bill length by island

It appears that penguins on Torgersen Island have shorter bills on average than penguins on other islands.

Setup

The plan

Simulation

Again, while you could do this with a for-loop (and you can use a for-loop for hypothesis tests in labs and projects), we'll use the faster size approach here.

Instead of using np.random.multinomial, which samples from a categorical distribution, we'll use np.random.choice, which samples from a known sequence of values.

Visualizing the empirical distribution of the test statistic

It doesn't look like the average bill length of penguins on Torgersen Island came from the distribution of bill lengths of all penguins in our dataset.

Discussion Question

There is a statistical tool you've learned about that would allow us to find the true probability distribution of the test statistic in this case. What is it?


➡️ Click here to see the answer after you've thought about it. The Central Limit Theorem (CLT). Recall, the CLT tells us that for any population distribution, the distribution of the sample mean is roughly normal, with the same mean as the population mean. Furthermore, it tells that the standard deviation of the distribution of the sample mean is $\frac{\text{Population SD}}{\sqrt{\text{sample size}}}$. So, the distribution of sample means of samples of size 47 drawn from penguins['bill_length_mm'] is roughly normal with mean penguins['bill_length_mm'] and standard deviation penguins['bill_length_mm'].std(ddof=0) / np.sqrt(47).

Summary

The hypothesis testing "recipe"

Faced with a question about the data raised by an observation...

  1. Carefully pose the question as a testable "yes or no" hypothesis.
  2. Decide on a test statistic that helps differentiate between instances that would affirm or reject the hypothesis.
  3. Create a probability model for the data generating process that reflects the "known behavior" of the process.
  4. Simulate the data generating process using this probability model (the "null hypothesis").
  5. Assess if the observation is consistent with the simulations by computing a p-value.

Hypothesis testing vs. permutation testing

"Standard" hypothesis testing helps us answer questions of the form:

I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

It does not help us answer questions of the form:

I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

That's where permutation testing comes in.

Additional reading

Here are a few more slides with examples that we won't cover in lecture.

Null hypothesis

Alternative hypothesis

P-values and cutoffs