Lecture 27 – Review, Conclusion

DSC 10, Winter 2023

Announcements

Agenda

Bakeries 🧁

Consider this population of bakeries in San Francisco.

For reference, the mean and standard deviation of the population distribution are calculated below.

In this case we happen to have the inspection scores for all members of the population, but in reality we won't. So let's instead take a random sample of 200 bakeries from the population.

Note that since we took a large, random sample of the population, we expect that our sample looks similiar to the population and has a similar mean and SD.

Indeed, the sample mean is quite close to the population mean, and the sample standard deviation is quite close to the population standard deviation.

Let's suppose we want to estimate the population mean (that is, the mean inspection score of all bakeries in SF).

One estimate of the population mean is the mean of our sample.

However, our sample was random and could have been different, meaning our sample mean could also have been different.

Question: What's a reasonable range of possible values for the sample mean? What is the distribution of the sample mean?

The Central Limit Theorem

The Central Limit Theorem (CLT) says that the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

To see an empirical distribution of the sample mean, let's take a large number of samples directly from the population and compute the mean of each one.

Remember, in real life we wouldn't be able to do this, since we wouldn't have access to the population.

Unsurprisingly, the distribution of the sample mean is bell-shaped. The CLT told us that!

The CLT also tells us that

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

Let's try this out.

Pretty close! Remember that sample_means is an array of simulated sample means; the more samples we simulate, the closer that np.std(sample_means) will get to the SD described by the CLT.

Note that in practice, we won't have the SD of the population, since we'll usually just have a single sample. In such cases, we can use the SD of the sample as an estimate of the SD of the population:

Using the CLT, we have that the distribution of the sample mean:

Using this information, we can build a confidence interval for where we think the population mean might be. A 95% confidence interval for the population mean is given by

$$ \left[ \text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$

Concept Check ✅ – Answer at cc.dsc10.com

Using a single sample of 200 bakeries, how can we estimate the median inspection score of all bakeries in San Francisco with an inspection score? What technique should we use?

A. Standard hypothesis testing

B. Permutation testing

C. Bootstrapping

D. The Central Limit Theorem

Click for the answer after you've entered your guess above. Don't scroll any further. Bootstrapping. The CLT only applies to sample means (and sums), not to any other statistics.

There is no CLT for sample medians, so instead we'll have to resort to bootstrapping to estimate the distribution of the sample median.

Recall, bootstrapping is the act of sampling from the original sample, with replacement. This is also called resampling.

Let's resample repeatedly.

Note that this distribution is not at all normal.

To compute a 95% confidence interval, we take the middle 95% of the bootstrapped medians.

Discussion Question

Which of the following interpretations of this confidence interval are valid?

  1. 95% of SF bakeries have an inspection score between 85 and 88.
  2. 95% of the resamples have a median inspection score between 85 and 88.
  3. There is a 95% chance that our sample has a median inspection score between 85 and 88.
  4. There is a 95% chance that the median inspection score of all SF bakeries is between 85 and 88.
  5. If we had taken 100 samples from the same population, about 95 of these samples would have a median inspection score between 85 and 88.
  6. If we had taken 100 samples from the same population, about 95 of the confidence intervals created would contain the median inspection score of all SF bakeries.
Click for the answer after you've entered your guess above. Don't scroll any further. The correct answers are Option 2 and Option 6.

Physicians 🩺

The setup

You work as a family physician. You collect data and you find that in 6354 patients, 3115 were children and 3239 were adults.

You want to test the following hypotheses:

Concept Check ✅ – Answer at cc.dsc10.com

Which test statistic(s) could be used for this hypothesis test? Which values of the test statistic point towards the alternative?

A. Proportion of children seen
B. Number of children seen
C. Number of children minus number of adults seen
D. Absolute value of number of children minus number of adults seen

There may be multiple correct answers; choose one.

Click for the answer after you've entered your guess above. Don't scroll any further. All of these but the last one would work for this alternative. Small values of these statistics would favor the alternative. If the alternative was instead "Family physicians see a different number of children and adults", the last option would work while the first three wouldn't.

Let's use option B, the number of children seen, as a test statistic. Small values of this statistic favor the alternative hypothesis.

How do we generate a single value of the test statistic?

As usual, let's simulate the test statistic many, many times (10,000).

Recall that you collected data and found that in 6354 patients, 3115 were children and 3239 were adults.

Concept Check ✅ – Answer at cc.dsc10.com

What goes in blank (a)?

p_value = np.count_nonzero(test_stats __(a)__ 3115) / 10000

A. >=

B. >

C. <=

D. <

Click for the answer after you've entered your guess above. Don't scroll any further. <=

Concept Check ✅ – Answer at cc.dsc10.com

What do we do, assuming that we're using a 5% p-value cutoff?

A. Reject the null

B. Fail to reject the null

C. It depends

Click for the answer after you've entered your guess above. Don't scroll any further. Fail to reject the null, since the p-value is above 0.05.

Note that while we used np.random.multinomial to simulate the test statistic, we could have used np.random.choice, too:

Concept Check ✅ – Answer at cc.dsc10.com

Is this an example of bootstrapping?

A. Yes, because we are sampling with replacement.

B. No, this is not bootstrapping.

Click for the answer after you've entered your guess above. Don't scroll any further. No, this is not bootstrapping. Bootstrapping is when we resample from a single sample; here we're simulating data under the assumptions of a model.

Personal projects

Using Jupyter Notebooks after DSC 10

Finding data

These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.

Note that all of these links are also available at rampure.org/find-datasets.

Domain-specific sources of data

Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.

Join a DS3 Project Group 🤝

The Data Science Student Society organizes project groups, which are a great way to get experience and build your resume. Keep your eye out for applications!

Demo: Gapminder 🌎

plotly

Gapminder dataset

Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. - Gapminder Wikipedia

The dataset contains information for each country for several different years.

Let's start by just looking at 2007 data (the most recent year in the dataset).

Scatter plot

We can plot life expectancy vs. GDP per capita. If you hover over a point, you will see the name of the country.

In future courses, you'll learn about transformations. Here, we'll apply a log transformation to the x-axis to make the plot look a little more linear.

Animated scatter plot

We can take things one step further.

Watch this video if you want to see an even-more-animated version of this plot.

Animated histogram

Choropleth

Parting thoughts

From Lecture 1: What is "data science"?

Data science is about drawing useful conclusions from data using computation. Throughout the quarter, we touched on several aspects of data science:

Thank you!

This course would not have been possible without...

Good luck on your finals! 🎉

And see you tomorrow at 3PM in Galbraith Hall 242. ⏰