Lecture 14 – Distributions and Sampling

DSC 10, Fall 2022

Announcements

Agenda

⚠️ The second half of the course is more conceptual than the first. Reading the textbook will become more critical.

Probability distributions vs. empirical distributions

Probability distributions

Example: Probability distribution of a die roll 🎲

The distribution is uniform, meaning that each outcome has the same probability of occurring.

Empirical distributions

Example: Empirical distribution of a die roll 🎲

Many die rolls 🎲

Why does this happen? ⚖️

The law of large numbers states that if a chance experiment is repeated

then the proportion of times that an event occurs gets closer and closer to the theoretical probability of that event.

For example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to $\frac{1}{6}$.

Sampling

Populations and samples

Question: How do we collect a good sample, so that the sample distribution closely approximates the population distribution?

Bad idea ❌: Survey whoever you can get ahold of (e.g. internet survey, people in line at Panda Express at PC).

Probability sample (aka random sample)

Example: Movies 🎥

A probability sample

Simple random sample

Sampling rows from a DataFrame

If we want to sample rows from a DataFrame, we can use the .sample method on a DataFrame. That is,

df.sample(n)

returns a random subset of n rows of df, drawn without replacement (i.e. the default is replace=False, unlike np.random.choice).

The effect of sample size

Example: Distribution of flight delays ✈️

united_full contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

We only need the 'Delay's, so let's select just that column.

Population distribution of flight delays ✈️

Note that this distribution is fixed – nothing about it is random.

Sample distribution of flight delays ✈️

Note that as we increase sample_size, the sample distribution of delays looks more and more like the true population distribution of delays.

Parameters and statistics

Terminology

To remember: parameter and population both start with p, statistic and sample both start with s.

Mean flight delay ✈️

Question: What is the average delay of United flights out of SFO? 🤔

Population mean

The population mean is a parameter.

This number (like the population distribution) is fixed, and is not random. In reality, we would not be able to see this number – we can only see it right now because this is a pedagogical demonstration!

Sample mean

The sample mean is a statistic. Since it depends on our sample, which was drawn at random, the sample mean is also random.

The effect of sample size

What if we choose a larger sample size?

Smaller samples:

Larger samples:

Probability distribution of a statistic

Empirical distribution of a statistic

Distribution of sample means

What's the point?

Concept Check ✅ – Answer at cc.dsc10.com

We just sampled one thousand flights, two thousand times. If we now sample one hundred flights, two thousand times, how will the histogram change?

How we sample matters!

Summary, next time

Summary

Next time

Next, we'll start talking about statistical models, which will lead us towards hypothesis testing.