Lecture 19 – Bootstrapping, Percentiles, and Confidence Intervals

DSC 10, Fall 2022

Announcements

Agenda

Resources

Bootstrapping 🥾

City of San Diego employee salary data

All City of San Diego employee salary data is public. We are using the latest available data.

We only need the 'TotalWages' column, so let's get just that column.

The median salary

Let's be realistic...

In the language of statistics

The sample median

Let's survey 500 employees at random. To do so, we can use the .sample method.

We won't reassign my_sample at any point in this notebook, so it will always refer to this particular sample.

How confident are we that this is a good estimate?

The sample median is random

An impractical approach

The problem

Note that unlike the previous histogram we saw, this is depicting the distribution of the population and of one particular sample (my_sample), not the distribution of sample medians for 1000 samples.

The bootstrap

To replace or not replace?

Running the bootstrap

We can simulate the act of collecting new samples by sampling with replacement from our original sample, my_sample.

Bootstrap distribution of the sample median

What's the point of bootstrapping?

We have a sample median wage:

With it, we can say that the population median wage is approximately \$72,016, and not much else.

But by bootstrapping, we can generate an empirical distribution of the sample median:

which allows us to say things like

We think the population median wage is between \$67,000 and \\$77,000.

Question: We could also say that we think the population median wage is between \$70,000 and \\$75,000, or between \$60,000 and \\$80,000. What range should we pick?

Percentiles

Mathematical definition

Let $p$ be a number between 0 and 100. The $p$th percentile of a collection is the smallest value in the collection that is at least as large as $p$% of all the values.

By this definition, any percentile between 0 and 100 can be computed for any collection of values and is always an element of the collection.

How to calculate percentiles using mathematical definition

Suppose there are $n$ elements in the collection. To find the $p$th percentile:

  1. Sort the collection in increasing order.
  1. Define $h$ to be $p\%$ of $n$:
$$h = \frac p{100} \cdot n$$
  1. If $h$ is an integer, define $k = h$. Otherwise, let $k$ be the smallest integer greater than $h$.
  1. Take the $k$th element of the sorted collection (start counting from 1, not 0).

Example

What is the 25th percentile of the array np.array([4, 10, 15, 21, 100])?


Click here to see the solution.
  1. First, we need to sort the collection in increasing order. Conveniently, it's already sorted!
  2. Define $h = \frac{p}{100} \cdot n$. Here, $p = 25$ and $n = 5$, so $h = \frac{25}{100} \cdot 5 = \frac{5}{4} = 1.25$.
  3. Since 1.25 is not an integer, $k$ must be the smallest integer greater than 1.25, which is 2.
  4. If we start counting at 1, the element at position 2 is 10, so the 25th percentile is 10.

Reflection

Consider the array from the previous slide, np.array([4, 10, 15, 21, 100]). Here's how our percentile formula works:

value 4 10 15 21 100
percentile [0, 20] (20, 40] (40, 60] (60, 80] (80, 100]

For instance, the 8th percentile is 4, the 50th percentile (median) is 15, and the 79th percentile is 21.

Notice that in the table above, each of the 5 values owns an equal percentage (20\%) of the range 0-100. 4 is the 20th percentile, but 10 is the 20.001st percentile.

Concept Check ✅ – Answer at cc.dsc10.com

What is the 70th percentile of the array np.array([70, 18, 56, 89, 55, 35, 10, 45])?

A. 35              B. 55              C. 56              D. 70              E. None of these


Click here to see the solution after you've tried it yourself.

  1. First, we need to sort the collection in increasing order. This gives us np.array([10, 18, 35, 45, 55, 56, 70, 89]).
  2. Define $h = \frac{p}{100} \cdot n$. Here, $p = 70$ and $n = 8$, so $h = \frac{70}{100} \cdot 8 = 5.6$.
  3. Since 5.6 is not an integer, $k$ must be the smallest integer greater than 5.6, which is 6.
  4. If we start counting at 1, the element at position 6 is 56, so the 70th percentile is 56.

Calculating the percentile using our mathematical definition

Another definition of percentile

Confidence intervals

Using the bootstrapped distribution of sample medians

Earlier in the lecture, we generated a bootstrapped distribution of sample medians.

What can we do with this distribution, now that we know about percentiles?

Using the bootstrapped distribution of sample medians

Confidence intervals

Let's be a bit more precise.

Finding endpoints

Computing a confidence interval

You will use the code above very frequently moving forward!

Visualizing our 95% confidence interval

Concept Check ✅ – Answer at cc.dsc10.com

We computed the following 95% confidence interval:

If we instead computed an 80% confidence interval, would it be wider or narrower?

A. Wider                  B. Narrower                  C. Impossible to tell </center

Reflection

Now, instead of saying

We think the population median is close to our sample median, \$72,016.

We can say:

A 95% confidence interval for the population median is \$67,081 to \\$76,383.

These endpoints may be slightly different than the endpoints we found, due to randomness.

Some lingering questions: What does 95% confidence mean? What are we confident about? Is this technique always "good"?

Summary, next time

Summary

Next time

We will: