Lecture 23 – The Central Limit Theorem, Choosing Sample Sizes

DSC 10, Fall 2022

Announcements

Agenda

The Central Limit Theorem

The Central Limit Theorem

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

Confidence intervals

Confidence intervals

Constructing a 95% confidence interval through bootstrapping

Let's draw one sample then bootstrap to generate 2000 resample means.

One approach to computing a confidence interval for the population mean involves taking the middle 95% of this distribution.

Middle 95% of a normal distribution

But we didn't need to bootstrap to learn what the distribution of the sample mean looks like. We could instead use the CLT, which tells us that the distribution of the sample mean is normal. Further, its mean and standard deviation are approximately:

So, the distribution of the sample mean is approximately:

Question: What interval on the $x$-axis captures the middle 95% of this distribution?

Recall: Normal approximations

As we saw last class, if a variable is roughly normal, then approximately 95% of its values are within 2 standard deviations of its mean.

Let's use this fact here!

Computing a 95% confidence interval using the CLT

$$\text{SD of Distribution of Possible Sample Means} \approx \frac{\text{Sample SD}}{\sqrt{\text{sample size}}}$$

Visualizing the CLT-based confidence interval

Comparing confidence intervals

We've constructed two confidence intervals for the population mean:

One using bootstrapping,

and one using the CLT.

In both cases, we only used information in my_sample, not the population.

The intervals created using each method are slightly different, because there are some approximations involved:

Recap: Confidence intervals for the population mean

A 95% confidence interval for the population mean is given by

$$ \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

This CI doesn't require bootstrapping, and it only requires three numbers – the sample mean, the sample SD, and the sample size!

Bootstrapping vs. the CLT

Bootstrapping still has its uses!

Bootstrap CLT
Pro Works for many sample statistics
(mean, median, standard deviation).
Only requires 3 numbers –
the sample mean, sample SD, and sample size.
Con Very computationally expensive (requires drawing many,
many samples from the original sample).
Only works for the sample mean (and sum).

Activity

We just saw that when $z = 2$, the following is a 95% confidence interval for the population mean.

$$ \left[\text{sample mean} - z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

Question: What value of $z$ should we use to create an 80% confidence interval? 90%?

Concept Check ✅ – Answer at cc.dsc10.com

Which one of these histograms corresponds to the distribution of the sample mean for samples of size 100 drawn from a population with mean 50 and SD 20?

Hypothesis testing, revisited

Hypothesis testing for the mean

Using a confidence interval for hypothesis testing

Example: Body temperature 🌡

Setting up a hypothesis test

The mean body temperature of all people is a population mean!

CI for mean body temperature

$$ \left[ \text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

Careful! This doesn't mean that 95% of temperatures in our sample (or the population) fall in this range!

Conclusion

Choosing sample sizes

Example: Polling

Question: How big of a sample do you need? 🤔

Aside: Proportions are just means

$$\frac{0 + 1 + 1 + 0 + 1}{5} = \frac{3}{5}$$

Key takeaway: The CLT applies in this case as well! The distribution of the proportion of 1s in our sample is roughly normal.

Our strategy

We will:

  1. Collect a random sample.
  2. Compute the sample mean (i.e., the proportion of people who say "yes").
  3. Compute the sample standard deviation.
  4. Construct a 95% confidence interval for the population mean:
$$ \left[ \text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

Note that the width of our CI is the right endpoint minus the left endpoint:

$$ \text{width} = 4 \cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} $$

Our strategy

$$\text{width} = 4 \cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}$$
$$4 \cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \leq 0.06$$
$$\sqrt{\text{sample size}} \geq 4 \cdot \frac{\text{sample SD}}{0.06} \\ \implies \boxed{\text{sample size} \geq \left( 4 \cdot \frac{\text{sample SD}}{0.06} \right)^2}$$

Upper bound for the standard deviation of a sample

$$\text{SD of Collection of 0s and 1s} = \sqrt{(\text{Prop. of 0s}) \times (\text{Prop. of 1s})}$$

Choosing a sample size

$$\text{sample size} \geq \left( 4 \cdot \frac{\text{sample SD}}{0.06} \right)^2$$

Choosing a sample size

$$\text{sample size} \geq \left( 4 \cdot \frac{\text{sample SD}}{0.06} \right)^2$$

By substituting 0.5 for the sample size, we get

$$\text{sample size} \geq \left( 4 \cdot \frac{\text{0.5}}{0.06} \right)^2$$

While any sample size that satisfies the above inequality will give us a confidence interval that satisfies the necessary properties, it's time-consuming to gather larger samples than necessary. So, we'll pick the smallest sample size that satisfies the above inequality.

Conclusion: We must sample 1112 people to construct a 95% CI for the population mean that is at most 0.06 wide.

Activity

Suppose we instead want an a 95% CI for the population mean that is at most 0.03 wide. What is the smallest sample size we could collect?

Hint: Use the fact that we must sample 1112 people for a 95% CI for the population mean that is at most 0.06 wide.


Click here to see the answer after you've attempted the question yourself. $\text{sample size} \geq \left( 4 \cdot \frac{\text{0.5}}{0.03} \right)^2 = 4444.44..$, so the smallest sample size we could collect is 4445.

Summary, next time

Summary

$$ \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

What we've learned about inference

At a high level, the second half of this class has been about statistical inference – using a sample to draw conclusions about the population.

Next time