Lecture 20 – Confidence Intervals, Center and Spread

DSC 10, Fall 2022

Announcements

Agenda

Interpreting confidence intervals

Recap: City of San Diego employee salaries

Let's rerun our code from last time to compute a 95% confidence interval for the median salary of all San Diego city employees, based on a sample of 500 people.

Step 1: Collect a single sample of size 500 from the population.

Step 2: Bootstrap! That is, resample from the sample a large number of times, and each time, compute the median of the resample. This will generate an empirical distribution of the sample median.

Step 3: Take the middle 95% of the empirical distribution of sample medians (i.e. boot_medians). This creates our 95% confidence interval.

Confidence intervals describe a guess for the value of an unknown parameter

Now, instead of saying

We think the population median is close to our sample median, \$72,016.

We can say:

A 95% confidence interval for the population median is \$66,987 to \\$76,527.

Today, we'll address: What does 95% confidence mean? What are we confident about? Is this technique always "good"?

Interpreting confidence intervals

Capturing the true value

Many confidence intervals

In the visualization below,

Which confidence intervals don't contain the true parameter?

Confidence tradeoffs

Misinterpreting confidence intervals

Confidence intervals can be hard to interpret.

Does this interval contain 95% of all salaries? No!

However, this interval does contain 95% of all bootstrapped median salaries.

Is there is a 95% chance that this interval contains the population parameter? No!

Why not?

Bootstrap rules of thumb

Example: Estimating the max of a population

Visualize

Since we have access to the population, we can find the population maximum directly, without bootstrapping.

Does the population maximum lie within the bulk of the bootstrapped distribution?

No, the bootstrapped distribution doesn't capture the population maximum (blue dot) of \$359,138. Why not? 🤔

Confidence intervals for hypothesis testing

Using a confidence interval for hypothesis testing

It turns out that we can use bootstrapped confidence intervals for hypothesis testing!

Example: Fire-Rescue Department 🚒

Setting up a hypothesis test

Testing the hypotheses

Finding the interval

Is \$74,441 in this interval? No. ❌

Conclusion of the hypothesis test

Summary of methods

Mean and median

The mean (i.e. average)

The mean is a one-number summary of a set of numbers. For example, the mean of $2, 3, 3,$ and $9$ is $\frac{2 + 3 + 3 + 9}{4} = 4.25$.

Observe that the mean:

The median

Activity

Create a set of data points that has this histogram. (You can do it with a short list of whole numbers.)



What are its mean and median?

Concept Check ✅ – Answer at cc.dsc10.com

Are the means of these two distributions the same or different? What about the medians?

Example: Flight delays ✈️

Question: Which is larger – the mean or the median?

Comparing the mean and median

Standard deviation

Question: How "wide" is a distribution?

Deviations from the mean

Each entry in deviations describes how far the corresponding element in data is from 4.25.

What is the average deviation?

Average squared deviation

This quantity, the average squared deviation from the mean, is called the variance.

Standard deviation

Standard deviation

Variance and standard deviation

To summarize:

$$\begin{align*}\text{variance} &= \text{average squared deviation from the mean}\\ &= \frac{(\text{value}_1 - \text{mean})^2 + ... + (\text{value}_n - \text{mean})^2}{n}\\ \text{standard deviation} &= \sqrt{\text{variance}} \end{align*}$$

where $n$ is the number of observations.

What can we do with the standard deviation?

It turns out, no matter what the shape of the distribution is, the bulk of the data are in the range “average ± a few SDs”.

More on this next class!

Summary, next time

Summary: Confidence intervals and hypothesis testing

Summary: Center and spread

Next time