Lecture 21 – Spread, The Normal Distribution

DSC 10, Spring 2023

Announcements

Agenda

Central tendency

Example: Flight delays ✈️

Question: Which is larger – the mean or the median?

Comparing the mean and median

Standard deviation

Question: How "wide" is a distribution?

Deviations from the mean

Each entry in deviations describes how far the corresponding element in data is from 4.25.

What is the average deviation?

Average squared deviation

This quantity, the average squared deviation from the mean, is called the variance.

Standard deviation

Standard deviation

Variance and standard deviation

To summarize:

$$\begin{align*}\text{variance} &= \text{average squared deviation from the mean}\\ &= \frac{(\text{value}_1 - \text{mean})^2 + ... + (\text{value}_n - \text{mean})^2}{n}\\ \text{standard deviation} &= \sqrt{\text{variance}} \end{align*}$$

where $n$ is the number of observations.

What can we do with the standard deviation?

It turns out, in any numerical distribution, the bulk of the data are in the range “mean ± a few SDs”.

Let's make this more precise.

Chebyshev’s inequality

Fact: In any numerical distribution, the proportion of values in the range “mean ± $z$ SDs” is at least

$$1 - \frac{1}{z^2} $$
Range Proportion
mean ± 2 SDs at least $1 - \frac{1}{4}$ (75%)
mean ± 3 SDs at least $1 - \frac{1}{9}$ (88.88..%)
mean ± 4 SDs at least $1 - \frac{1}{16}$ (93.75%)
mean ± 5 SDs at least $1 - \frac{1}{25}$ (96%)

Flight delays, revisited

Mean and standard deviation

Chebyshev's inequality tells us that

Let's visualize these intervals!

Chebyshev's inequality provides lower bounds!

Remember, Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ SDs from the mean, for any numerical distribution.

For instance, it tells us that at least 75% of delays are in the following interval:

However, in this case, a much larger fraction of delays are in that interval.

If we know more about the shape of the distribution, we can provide better guarantees for the proportion of values within $z$ SDs of the mean.

Activity

For a particular set of data points, Chebyshev's inequality states that at least $\frac{8}{9}$ of the data points are between $-20$ and $40$. What is the standard deviation of the data?

✅ Click here to see the answer after you've tried it yourself. - Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ standard deviations of the mean. - When $z = 3$, $1 - \frac{1}{z^2} = \frac{8}{9}$. - So, $-20$ is $3$ standard deviations below the mean, and $40$ is $3$ standard deviations above the mean. - $10$ is in the middle of $-20$ and $40$, so the mean is $10$. - $3$ standard deviations are between $10$ and $40$, so $1$ standard deviation is $\frac{30}{3} = 10$.

Standardization

Heights and weights 📏

We'll work with a data set containing the heights and weights of 5000 adult males.

Distributions of height and weight

Let's look at the distributions of both numerical variables.

Observation: The two distributions look like shifted and stretched versions of the same basic shape, called a bell curve 🔔.

Standard units

Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. The function $$z(x_i) = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$

converts $x_i$ to standard units, which represents the number of standard deviations $x_i$ is above the mean.

Example: Suppose someone weighs 225 pounds. What is their weight in standard units?

Standardization

The process of converting all values of a variable (i.e. a column) to standard units is known as standardization, and the resulting values are considered to be standardized.

The effect of standardization

Standardized variables have:

We often standardize variables to bring them to the same scale.

Aside: To quickly see summary statistics for a numerical Series, use the .describe() Series method.

Let's look at how the process of standardization works visually.

Standardized histograms

Now that we've standardized the distributions of height and weight, let's see how they look on the same set of axes.

These both look pretty similar!

The standard normal distribution

The standard normal distribution

$$ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^2} $$

The standard normal curve

Heights/weights are roughly normal

If a distribution follows this shape, we say it is roughly normal.

The standard normal distribution

Cumulative density functions

Areas under the standard normal curve

What does scipy.stats.norm.cdf(0) evaluate to? Why?

Areas under the standard normal curve

Suppose we want to find the area to the right of 2 under the standard normal curve.

The following expression gives us the area to the left of 2.

However, since the total area under the standard normal curve is 1:

$$\text{area right of $2$} = 1 - (\text{area left of $2$})$$

Areas under the standard normal curve

How might we use stats.norm.cdf to compute the area between -1 and 0?

Strategy:

$$\text{area from $-1$ to $0$} = (\text{area left of $0$}) - (\text{area left of $-1$})$$

General strategy for finding area

The area under a standard normal curve in the interval $[a, b]$ is

stats.norm.cdf(b) - stats.norm.cdf(a)

What can we do with this? We're about to see!

Using the normal distribution

Let's return to our data set of heights and weights.

As we saw before, both variables are roughly normal. What benefit is there to knowing that the two distributions are roughly normal?

Standard units and the normal distribution

Example: Proportion of weights between 200 and 225 pounds

Let's suppose, as is often the case, that we don't have access to the entire distribution of weights, but just the mean and SD.

Using just this information, we can estimate the proportion of weights between 200 and 225 pounds:

  1. Convert 200 to standard units.
  2. Convert 225 to standard units.
  3. Use stats.norm.cdf to find the area between (1) and (2).

Checking the approximation

Since we have access to the entire set of weights, we can compute the true proportion of weights between 200 and 225 pounds.

Pretty good for an approximation! 🤩

Warning: Standardization doesn't make a distribution normal!

Consider the distribution of delays from earlier in the lecture.

The distribution above does not look normal. It won't look normal even if we standardize it. By standardizing a distribution, all we do is move it horizontally and stretch it vertically – the shape itself doesn't change.

Summary, next time

Summary: Spread and Chebyshev's inequality

Summary: Standard units and the normal distribution

Next time