In [1]:

# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 14 – Distributions and Sampling¶

DSC 10, Fall 2022¶

Announcements¶

Homework 4 is due tomorrow at 11:59PM.
The Midterm Project is due Tuesday 11/1 at 11:59PM. Use pair programming 👯. See this post for clarifications.
The Midterm Exam is on Friday 10/28 during lecture. See this post for lots of details, including how to find your assigned seat, what to bring, and how to study.
10+ more weekly office hours are new this week!
- We take your responses to the Mid-Quarter Feedback Form seriously.

Agenda¶

Probability distributions vs. empirical distributions.
Populations and samples.
Parameters and statistics.

⚠️ The second half of the course is more conceptual than the first. Reading the textbook will become more critical.

Probability distributions vs. empirical distributions¶

Probability distributions¶

Consider a random quantity with various possible values, each of which has some associated probability.

A probability distribution is a description of:
- All possible values of the quantity.
- The theoretical probability of each value.

Example: Probability distribution of a die roll 🎲¶

The distribution is uniform, meaning that each outcome has the same probability of occurring.

In [2]:

die_faces = np.arange(1, 7, 1)
die = bpd.DataFrame().assign(face=die_faces)
die

Out[2]:

	face
0	1
1	2
2	3
3	4
4	5
5	6

In [3]:

bins = np.arange(0.5, 6.6, 1)

# Note that you can add titles to your visualizations, like this!
die.plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
         title='Probability Distribution of a Die Roll',
         figsize=(5, 3))

# You can also set the y-axis label with plt.ylabel
plt.ylabel('Probability');

Empirical distributions¶

Unlike probability distributions, which are theoretical, empirical distributions are based on observations.

Commonly, these observations are of repetitions of an experiment.

An empirical distribution describes:
- All observed values.
- The proportion of observations in which each value occurred.

Unlike probability distributions, empirical distributions represent what actually happened in practice.

Example: Empirical distribution of a die roll 🎲¶

Let's simulate a roll by using np.random.choice.
Rolling a die = sampling with replacement.
- If you roll a 4, you can roll a 4 again.

In [4]:

num_rolls = 25
many_rolls = np.random.choice(die_faces, num_rolls)
many_rolls

Out[4]:

array([5, 4, 3, ..., 1, 6, 1])

In [5]:

(bpd.DataFrame()
 .assign(face=many_rolls) 
 .plot(kind='hist', y='face', bins=bins, density=True, ec='w',
       title=f'Empirical Distribution of {num_rolls} Dice Rolls',
       figsize=(5, 3))
)
plt.ylabel('Probability');

Many die rolls 🎲¶

In [6]:

for num_rolls in [10, 50, 100, 500, 1000, 5000, 10000]:
    # Don't worry about how .sample works just yet – we'll cover it shortly
    (die.sample(n=num_rolls, replace=True)
     .plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
           title=f'Distribution of {num_rolls} Die Rolls',
           figsize=(8, 3))
    )

Why does this happen? ⚖️¶

The law of large numbers states that if a chance experiment is repeated

under the same conditions,

then the proportion of times that an event occurs gets closer and closer to the theoretical probability of that event.

For example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to $\frac{1}{6}$.

Populations and samples¶

A population is the complete group of people, objects, or events that we want to learn something about.

It's often infeasible to collect information about every member of a population.

Instead, we can collect a sample, which is a subset of the population.

Goal: estimate the distribution of some numerical variable in the population, using only a sample.
- For example, say we want to know the number of credits each UCSD student is taking this quarter.
- It's too hard to get this information for every UCSD student – we can't find the population distribution.
- Instead, we can collect data from a subset of UCSD students, to compute a sample distribution.

Question: How do we collect a good sample, so that the sample distribution closely approximates the population distribution?

Bad idea ❌: Survey whoever you can get ahold of (e.g. internet survey, people in line at Panda Express at PC).

Such a sample is known as a convenience sample.
Convenience samples often contain hidden sources of bias.

Probability sample (aka random sample)¶

In order for a sample to be a probability sample, you must be able to calculate the probability of selecting any subset of the population.

Not all individuals need to have an equal chance of being selected.

Example: Movies 🎥¶

In [7]:

top = bpd.read_csv('data/top_movies.csv')
top

Out[7]:

	Title	Studio	Gross	Gross (Adjusted)	Year
0	Star Wars: The Force Awakens	Buena Vista (Disney)	906723418	906723400	2015
1	Avatar	Fox	760507625	846120800	2009
2	Titanic	Paramount	658672302	1178627900	1997
...	...	...	...	...	...
197	Duel in the Sun	Selz.	20408163	443877500	1946
198	Sergeant York	Warner Bros.	16361885	418671800	1941
199	The Four Horsemen of the Apocalypse	MPC	9183673	399489800	1921

200 rows × 5 columns

A probability sample¶

Scheme: Start with a random number between 0 and 9 take every tenth row thereafter.
- This is a probability sample!
Any given row is equally likely to be picked, with probability $\frac{1}{10}$.
It is not true that every subset of rows has the same probability of being selected.
- There are only 10 possible samples: rows (0, 10, 20, 30, ..., 190), rows (1, 11, 21, ..., 191), and so on.

In [8]:

start = np.random.choice(np.arange(10))
top.take(np.arange(start, 200, 10))

Out[8]:

	Title	Studio	Gross	Gross (Adjusted)	Year
7	Star Wars	Fox	460998007	1549640500	1977
17	The Hunger Games	Lionsgate	408010692	442510400	2012
27	The Passion of the Christ	NM	370782930	519432100	2004
...	...	...	...	...	...
177	Cleopatra (1963)	Fox	57777778	584496100	1963
187	Swiss Family Robinson	Disney	40356000	468129600	1960
197	Duel in the Sun	Selz.	20408163	443877500	1946

20 rows × 5 columns

Simple random sample¶

A simple random sample (SRS) is a sample drawn uniformly at random without replacement.

In an SRS...
- Every individual has the same chance of being selected.
- Every pair has the same chance of being selected.
- Every triplet has the same chance of being selected.
- And so on...

To perform an SRS from a list or array options, we use np.random.choice(options, replace=False).
- If we use replace=True, then we're sampling uniformly at random with replacement – there's no simpler term for this.

Sampling rows from a DataFrame¶

If we want to sample rows from a DataFrame, we can use the .sample method on a DataFrame. That is,

df.sample(n)

returns a random subset of n rows of df, drawn without replacement (i.e. the default is replace=False, unlike np.random.choice).

In [9]:

# Without replacement
top.sample(5)

Out[9]:

	Title	Studio	Gross	Gross (Adjusted)	Year
4	Marvel's The Avengers	Buena Vista (Disney)	623357910	668866600	2012
78	Toy Story 2	Buena Vista (Disney)	245852179	416177700	1999
177	Cleopatra (1963)	Fox	57777778	584496100	1963
166	Pinocchio	Disney	84254167	586409000	1940
42	Iron Man	Paramount	318412101	385808100	2008

In [10]:

# With replacement
top.sample(5, replace=True)

Out[10]:

	Title	Studio	Gross	Gross (Adjusted)	Year
163	Peter Pan	Disney	87404651	396924700	1953
177	Cleopatra (1963)	Fox	57777778	584496100	1963
178	2001: A Space Odyssey	MGM	56954992	377027700	1968
167	M.A.S.H.	Fox	81600000	467052600	1970
78	Toy Story 2	Buena Vista (Disney)	245852179	416177700	1999

The effect of sample size¶

The law of large numbers states that when we repeat a chance experiment more and more times, the empirical distribution will look more and more like the true probability distribution.

Similarly, if we take a large simple random sample, then the sample distribution is likely to be a good approximation of the true population distribution.

Example: Distribution of flight delays ✈️¶

united_full contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

In [11]:

united_full = bpd.read_csv('data/united_summer2015.csv')
united_full

Out[11]:

	Date	Flight Number	Destination	Delay
0	6/1/15	73	HNL	257
1	6/1/15	217	EWR	28
2	6/1/15	237	STL	-3
...	...	...	...	...
13822	8/31/15	1994	ORD	3
13823	8/31/15	2000	PHX	-1
13824	8/31/15	2013	EWR	-2

13825 rows × 4 columns

We only need the 'Delay's, so let's select just that column.

In [12]:

united = united_full.get(['Delay'])
united

Out[12]:

	Delay
0	257
1	28
2	-3
...	...
13822	3
13823	-1
13824	-2

13825 rows × 1 columns

Population distribution of flight delays ✈️¶

In [13]:

bins = np.arange(-20, 300, 10)
united.plot(kind='hist', y='Delay', bins=bins, density=True, ec='w', 
            title='Population Distribution of Flight Delays', figsize=(8, 3))
plt.ylabel('Proportion per minute');

Note that this distribution is fixed – nothing about it is random.

Sample distribution of flight delays ✈️¶

The 13825 flight delays in united constitute our population.
Normally, we won't have access to the entire population.
To replicate a real-world scenario, we will sample from united without replacement.

In [14]:

# Sample distribution
sample_size = 100
(united
 .sample(sample_size)
 .plot(kind='hist', y='Delay', bins=bins, density=True, ec='w',
       title='Sample Distribution of Flight Delays',
       figsize=(8, 3))
);

Note that as we increase sample_size, the sample distribution of delays looks more and more like the true population distribution of delays.

Parameters and statistics¶

Terminology¶

Statistical inference is the practice of making conclusions about a population, using data from a random sample.

Parameter: A number associated with the population.
- Example: The population mean.

Statistic: A number calculated from the sample.
- Example: The sample mean.

A statistic can be used as an estimate for a parameter.

To remember: parameter and population both start with p, statistic and sample both start with s.

Mean flight delay ✈️¶

Question: What is the average delay of United flights out of SFO? 🤔

We'd love to know the mean delay in the population (parameter), but in practice we'll only have a sample.

How does the mean delay in the sample (statistic) compare to the mean delay in the population (parameter)?

Population mean¶

The population mean is a parameter.

In [15]:

# Calculate the mean of the population
united_mean = united.get('Delay').mean()
united_mean

Out[15]:

16.658155515370705

This number (like the population distribution) is fixed, and is not random. In reality, we would not be able to see this number – we can only see it right now because this is a pedagogical demonstration!

Sample mean¶

The sample mean is a statistic. Since it depends on our sample, which was drawn at random, the sample mean is also random.

In [16]:

# Size 100
united.sample(100).get('Delay').mean()

Out[16]:

14.68

Each time we run the cell above, we are:
- Collecting a new sample of size 100 from the population, and
- Computing the sample mean.
We see a slightly different value on each run of the cell.
- Sometimes, the sample mean is close to the population mean.
- Sometimes, it's far away from the population mean.

The effect of sample size¶

What if we choose a larger sample size?

In [17]:

# Size 1000
united.sample(1000).get('Delay').mean()

Out[17]:

16.276

Each time we run this cell, the result is still slightly different.
However, the results seem to be much closer together – and much closer to the true population mean – than when we used a sample size of 100.
In general, statistics computed on larger samples tend to be more accurate than statistics computed on smaller samples.

Smaller samples:

Probability distribution of a statistic¶

The value of a statistic, e.g. the sample mean, is random, because it depends on a random sample.

Like other random quantities, we can study the "probability distribution" of the statistic (also known as its "sampling distribution").
- This describes all possible values of the statistic and all the corresponding probabilities.
- Why? We want to know how different our statistic could have been, had we collected a different sample.

Unfortunately, this can be hard to calculate exactly.
- Option 1: Do the math by hand.
- Option 2: Generate all possible samples and calculate the statistic on each sample.

So we'll use simulation again to approximate:
- Generate a lot of possible samples and calculate the statistic on each sample.

Empirical distribution of a statistic¶

The empirical distribution of a statistic is based on simulated values of the statistic. It describes
- all the observed values of the statistic, and
- the proportion of times each value appeared.

The empirical distribution of a statistic can be a good approximation to the probability distribution of the statistic, if the number of repetitions in the simulation is large.

Distribution of sample means¶

Let's...
- Repeatedly draw a bunch of samples.
- Record the mean of each.
- Draw a histogram of the resulting distribution.
Try different sample sizes and look at the resulting histogram!

In [18]:

# Sample one thousand flights, two thousand times
sample_size = 1000
repetitions = 2000
sample_means = np.array([])

for n in np.arange(repetitions):
    m = united.sample(sample_size).get('Delay').mean()
    sample_means = np.append(sample_means, m)

bpd.DataFrame().assign(sample_means=sample_means) \
               .plot(kind='hist', bins=np.arange(10, 25, 0.5), density=True, ec='w',
                     title=f'Distribution of Sample Mean with Sample Size {sample_size}',
                     figsize=(10, 5));
    
plt.axvline(x=united_mean, c='black');

What's the point?¶

In practice, we will only be able to collect one sample and calculate one statistic.
- Sometimes, that sample will be very representative of the population, and the statistic will be very close to the parameter we are trying to estimate.
- Other times, that sample will not be as representative of the population, and the statistic will not be very close to the parameter we are trying to estimate.

The empirical distribution of the sample mean helps us answer the question "what would the sample mean have looked like if we drew a different sample?"

Concept Check ✅ – Answer at cc.dsc10.com ¶

We just sampled one thousand flights, two thousand times. If we now sample one hundred flights, two thousand times, how will the histogram change?

A. narrower
B. wider
C. shifted left
D. shifted right
E. unchanged

How we sample matters!¶

So far, we've taken large simple random samples, without replacement, from the full population.
- If the population is large enough, then it doesn't really matter if we sample with or without replacement.

The sample mean, for samples like this, is a good approximation of the population mean.

But this is not always the case if we sample differently.

Summary, next time¶

Summary¶

The probability distribution of a random quantity describes the values it takes on along with the probability of each value occurring.
An empirical distribution describes the values and frequencies of the results of a random experiment.
- With more trials of an experiment, the empirical distribution gets closer to the probability distribution.
A population distribution describes the values and frequencies of some characteristic of a population.
A sample distribution describes the values and frequencies of some characteristic of a sample, which is a subset of a population.
- When we take a simple random sample, as we increase our sample size, the sample distribution gets closer and closer to the population distribution.
A parameter is a number associated with a population, and a statistic is a number associated with a sample.
We can use statistics calculated on a random samples to estimate population parameters.
- For example, to estimate the mean of a population, we can calculate the mean of the sample.
- Larger samples tend to lead to better estimates.

Lecture 14 – Distributions and Sampling¶

DSC 10, Fall 2022¶

Announcements¶

Agenda¶

Probability distributions vs. empirical distributions¶

Probability distributions¶

Example: Probability distribution of a die roll 🎲¶

Empirical distributions¶

Example: Empirical distribution of a die roll 🎲¶

Many die rolls 🎲¶

Why does this happen? ⚖️¶

Sampling¶

Populations and samples¶

Probability sample (aka random sample)¶

Example: Movies 🎥¶

A probability sample¶

Simple random sample¶

Sampling rows from a DataFrame¶

The effect of sample size¶

Example: Distribution of flight delays ✈️¶

Population distribution of flight delays ✈️¶

Sample distribution of flight delays ✈️¶

Parameters and statistics¶

Terminology¶

Mean flight delay ✈️¶

Population mean¶

Sample mean¶

The effect of sample size¶

Probability distribution of a statistic¶

Empirical distribution of a statistic¶

Distribution of sample means¶

What's the point?¶

Concept Check ✅ – Answer at cc.dsc10.com¶

How we sample matters!¶

Summary, next time¶

Summary¶

Next time¶

Concept Check ✅ – Answer at cc.dsc10.com ¶