Bootstrapping and Confidence Intervals
Concept
We use a bootstrapping to estimate the distribution of the sample statistic to see how different our it could have been. We use confidence intervals to define a range that captures most of the bootstrapped distribution of the sample statistic.
Bootstrapping: Bootstrapping is a type of hypothesis test that involves resampling from a single sample to estimate the distribution of the sample statistic. It answers the question of how different the sample statistic could have been if given a different sample. To conduct bootstrapping:
- Resample from the original sample with replacement.
- Calculate the sample statistic on the bootstrapped resample.
- Save the results into an array.
- Repeat steps 1 through 3 to generate an empirical distribution of the test statistic.
- Calculate the confidence interval and see if the observed statistic lies in it. If the observed statistic is not in our confidence interval, we have evidence to reject the null.
Confidence Intervals: A confidence interval is a range that captures most of the distribution of the bootstrapped sample statistic in the hopes of also containing the true population parameter within it. If we were to construct a 95% confidence interval, we aren't saying that there is a 95% chance that the true population parameter lies in the interval as the interval either contains it or it doesn't. Instead, we are saying that approximately 95% of the time, the intervals you create will contain the true population parameter. For example, if we generated 100 confidence intervals, about 95 of them will have the true population parameter.
When resampling, the size of the resample should be the same as the original sample with replacement.
The diagram below provides an overview of conducting bootstrapping, although it references a different dataset.
The diagram below provides an overview of creating confidence intervals, although it references a different situation. For additional helpful visual guides, please visit the Diagrams site.
Code Example
1. Take a random sample of size 12 from the full_pets
DataFrame.
Let's say we didn't have access to all of the information in the full_pets
DataFrame and were only able to collect a sample of 12 pets.
# Magic to ensure that we get the same results every time this code is run.
np.random.seed(42)
# sample
pets_sample = full_pets.sample(12, replace=False)
pets_sample
Index | Unnamed: 0 | ID | Species | Color | Weight | Age | Is_Cat | Owner_Comment |
---|---|---|---|---|---|---|---|---|
18 | 18 | cat_006 | cat | black | 3 | 0.5 | True | No, thank you! |
14 | 14 | dog_007 | dog | white | 50 | 6.1 | False | No, thank you! |
4 | 4 | dog_003 | dog | black | 25 | 0.5 | False | Be the person your dog thinks you are. |
13 | 13 | ham_003 | hamster | black | 0.5 | 0.1 | False | No, thank you! |
10 | 10 | dog_006 | dog | golden | 35 | 4 | False | No, thank you! |
7 | 7 | cat_003 | cat | black | 10 | 0 | True | No, thank you! |
6 | 6 | ham_002 | hamster | golden | 0.25 | 0.2 | False | No, thank you! |
3 | 3 | dog_002 | dog | white | 80 | 2 | False | Love is a wet nose and a wagging tail. |
2 | 2 | cat_002 | cat | black | 15 | 9 | True | ****All you need is love and a cat.**** |
15 | 15 | ham_004 | hamster | golden | 0.25 | 0.2 | False | No, thank you! |
17 | 17 | dog_009 | dog | white | 30 | 4.8 | False | No, thank you! |
8 | 8 | dog_004 | dog | black | 45 | 6.7 | False | No, thank you! |
2. Find the observed parameter
In this case, we are interested in finding the median weight of the entire population.
pets_sample = full_pets.sample(12, replace=False)
print('Median of pets_sample weight:', pets_sample.get('Weight').median())
Median of pets_sample weight: 20.0
3. Bootstrap the sample 10,000 times with replacement
Since we were only able to collect one random sample from the full population, we can't be sure if this singular guess predicts the true population parameter well. We can't go out and collect another random sample, so we will resample from the original sample with replacement to simulate what could've been.
boot_medians = np.array([])
for i in np.arange(10000):
# 1. resample the data
resample = pets_sample.sample(pets_sample.shape[0], replace=True)
# 2. calculate the median of the resample
boot_median = resample.get('Weight').median()
# 3. append the median to the array
boot_medians = np.append(boot_medians, boot_median)
This code will create 10,000 bootstrapped samples and calculate the median for each of them, but a different reasonable number can be used instead. Since these samples are all random, the information in each sample and median will be different from one another.
4. Create a 95% confidence interval
Instead of using a single number to estimate the true population parameter, we create a range of where we think it is.
# Get the 95% confidence interval
left = np.percentile(boot_medians, 2.5) # 2.5th percentile
right = np.percentile(boot_medians, 97.5) # 97.5th percentile
Remember that the 95% confidence interval does not mean we have a 95% chance of containing the true population parameter. It means that about 95% of all intervals we create will contain the true population parameter.
5. Conclusion
left, right
(1.75, 40.0)
- From this interval, we are 95% confident that the true population median lies somewhere between 1.75 and 40.
- We have no way of knowing where exactly in this interval does the true population median falls or even if it is contained at all.
- What we do know is that if we were to repeat the process and generate multiple confidence intervals, roughly 95% of them will contain the true population median.
6. Extra
Let's look at the distribution of the bootstrapped medians!
# Create the histogram.
# Plot the histogram of boot_medians
plt.hist(boot_medians, bins=20, density=True, ec = 'w')
plt.show()
A 95% confidence level means that approximately 95% of the time, the intervals you create through this process will contain the true population parameter.