Lecture 22 – The Normal Distribution, The Central Limit Theorem

DSC 10, Fall 2022

Announcements

Agenda

Recap: Standard units

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

The normal distribution

Recap: The standard normal distribution

Areas under the standard normal curve

What does scipy.stats.norm.cdf(0) evaluate to? Why?

Areas under the standard normal curve

Suppose we want to find the area to the right of 2 under the standard normal curve.

The following expression gives us the area to the left of 2.

However, since the total area under the standard normal curve is 1:

$$\text{area right of $2$} = 1 - (\text{area left of $2$})$$

Areas under the standard normal curve

How might we use stats.norm.cdf to compute the area between -1 and 0?

Strategy:

$$\text{area from $-1$ to $0$} = (\text{area left of $0$}) - (\text{area left of $-1$})$$

General strategy for finding area

The area under the standard normal curve in the interval $[a, b]$ is

stats.norm.cdf(b) - stats.norm.cdf(a)

What can we do with this? We're about to see!

Using the normal distribution

Let's return to our data set of heights and weights.

As we saw before, both variables are roughly normal. What benefit is there to knowing that the two distributions are roughly normal?

Standard units and the normal distribution

Example: Proportion of weights between 200 and 225 pounds

Let's suppose, as is often the case, that we don't have access to the entire distribution of weights, just the mean and SD.

Using just this information, we can estimate the proportion of weights between 200 and 225 pounds:

  1. Convert 200 to standard units.
  2. Convert 225 to standard units.
  3. Use stats.norm.cdf to find the area between (1) and (2).

Checking the approximation

Since we have access to the entire set of weights, we can compute the true proportion of weights between 200 and 225 pounds.

Pretty good for an approximation! 🤩

Warning: Standardization doesn't make a distribution normal!

Consider the distribution of delays from earlier in the lecture.

The distribution above does not look normal. It won't look normal even if we standardize it. By standardizing a distribution, all we do is move it horizontally and stretch it vertically – the shape itself doesn't change.

Center and spread, revisited

Special cases

Percent in Range Normal Distribution
$\text{mean} \pm 1 \: \text{SD}$ $\approx 68\%$
$\text{mean} \pm 2 \: \text{SDs}$ $\approx 95\%$
$\text{mean} \pm 3 \: \text{SDs}$ $\approx 99.73\%$

68% of values are within 1 SD of the mean

This means that if a variable follows a normal distribution, approximately 68% of values will be within 1 SD of the mean.

95% of values are within 2 SDs of the mean

Chebyshev's inequality and the normal distribution

Range All Distributions (via Chebyshev's inequality) Normal Distribution
mean $\pm \ 1$ SD $\geq 0\%$ $\approx 68\%$
mean $\pm \ 2$ SDs $\geq 75\%$ $\approx 95\%$
mean $\pm \ 3$ SDs $\geq 88.8\%$ $\approx 99.73\%$

Inflection points

Example: Inflection points

Remember: The distribution of heights is roughly normal, but it is not a standard normal distribution.

The Central Limit Theorem

Back to flight delays ✈️

The distribution of flight delays that we've been looking at is not roughly normal.

Empirical distribution of a sample statistic

Empirical distribution of the sample mean

Since we have access to the population of flight delays, let's remind ourselves what the distribution of the sample mean looks like by drawing samples repeatedly from the population.

Notice that this distribution is roughly normal, even though the population distribution was not! This distribution is centered at the population mean.

The Central Limit Theorem

The Central Limit Theorem (CLT) says that the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

While the formulas we're about to introduce only work for sample means, it's important to remember that the statement above also holds true for sample sums.

Characteristics of the distribution of the sample mean

Changing the sample size

The function sample_mean_delays takes in an integer sample_size, and:

  1. Takes a sample of size sample_size directly from the population.
  2. Computes the mean of the sample.
  3. Repeats steps 1 and 2 above 2000 times, and returns an array of the resulting means.

Let's call sample_mean_delays on several values of sample_size.

Let's look at the resulting distributions.

What do you notice? 🤔

Standard deviation of the distribution of the sample mean

It appears that as the sample size increases, the standard deviation of the distribution of the sample mean decreases quickly.

Standard deviation of the distribution of the sample mean

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

Recap: Distribution of the sample mean

If we were to take many, many samples of the same size from a population, and take the mean of each sample, the distribution of the sample mean will have the following characteristics:

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

🚨 Practical Issue: The mean and standard deviation of the distribution of the sample mean both depend on the original population, but we typically don't have access to the population!

Bootstrapping vs. the CLT

Estimating the distribution of the sample mean by bootstrapping

Let's take a single sample of size 500 from delays.

Before today, to estimate the distribution of the sample mean using just this sample, we'd bootstrap:

The CLT tells us what this distribution will look like, without having to bootstrap!

Using the CLT with just a single sample

Suppose all we have access to in practice is a single "original sample." If we were to take many, many samples of the same size from this original sample, and take the mean of each resample, the distribution of the (re)sample mean will have the following characteristics:

$$\begin{align} \text{SD of Distribution of Possible Sample Means} &= \frac{\text{Population SD}}{\sqrt{\text{sample size}}} \\ &\approx \boxed{\frac{\textbf{Sample SD}}{\sqrt{\text{sample size}}}} \end{align}$$

Let's test this out!

Using the CLT with just a single sample

Using just the original sample, my_sample, we estimate that the distribution of the sample mean has the following mean:

and the following standard deviation:

Let's draw a normal distribution with the above mean and standard deviation, and overlay the bootstrapped distribution from earlier.

Key takeaway: Given just a single sample, we can use the CLT to estimate the distribution of the sample mean, without bootstrapping.

Confidence intervals

Confidence intervals

Constructing a 95% confidence interval via the bootstrap

Earlier, we bootstrapped my_sample to generate 2000 resample means. One approach to computing a confidence interval for the population mean involves taking the middle 95% of this distribution.

Middle 95% of a normal distribution

Using the CLT and my_sample only, we estimate that the sample mean's distribution is the following normal distribution:

Question: What interval on the $x$-axis captures the middle 95% of the above distribution?

Recap: Normal distributions

As we saw earlier, if a variable is roughly normal, then approximately 95% of its values are within 2 standard deviations of its mean.

Let's use this fact here!

Computing a 95% confidence interval via the CLT

$$\text{SD of Distribution of Possible Sample Means} \approx \frac{\text{Sample SD}}{\sqrt{\text{sample size}}}$$

Visualizing the CLT-based confidence interval

Comparing confidence intervals

We've constructed two confidence intervals for the population mean:

One using bootstrapping,

and one using the CLT.

In both cases, we only used information in my_sample, not the population.

Recap: Confidence intervals for the population mean

An approximate 95% confidence interval for the population mean is given by

$$ \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

This CI doesn't require bootstrapping, and it only requires three numbers – the sample mean, the sample SD, and the sample size!

Bootstrap vs. the CLT

The bootstrap still has its uses!

Bootstrap CLT
Pro Works for many sample statistics
(mean, median, standard deviation).
Only requires 3 numbers –
the sample mean, sample SD, and sample size.
Con Very computationally expensive (requires drawing many,
many samples from the original sample).
Only works for the sample mean (and sum).

Summary, next time

Summary

$$ \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right]. $$

Next time