from dsc80_utils import *
def show_paradox_slides():
src = 'https://docs.google.com/presentation/d/e/2PACX-1vSbFSaxaYZ0NcgrgqZLvjhkjX-5MQzAITWAsEFZHnix3j1c0qN8Vd1rogTAQP7F7Nf5r-JWExnGey7h/embed?start=false&rm=minimal'
width = 960
height = 569
display(IFrame(src, width, height))
# Pandas Tutor setup
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
Lecture 3 – Aggregating, Simpsons Paradox¶
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
Other DataFrameGroupBy
methods¶
Split-apply-combine, revisited¶
When we introduced the split-apply-combine pattern, the "apply" step involved aggregation – our final DataFrame had one row for each group.

Instead of aggregating during the apply step, we could instead perform a:
Transformation, in which we perform operations to every value within each group.
Filtration, in which we keep only the groups that satisfy some condition.
Transformations¶
Suppose we want to convert the 'body_mass_g'
column to to z-scores (i.e. standard units):
def z_score(x):
return (x - x.mean()) / x.std(ddof=0)
z_score(penguins['body_mass_g'])
0 -0.57 1 -0.51 2 -1.19 ... 341 1.92 342 1.23 343 1.48 Name: body_mass_g, Length: 333, dtype: float64
Transformations within groups¶
Now, what if we wanted the z-score within each group?
To do so, we can use the
transform
method on aDataFrameGroupBy
object. Thetransform
method takes in a function, which itself takes in a Series and returns a new Series.A transformation produces a DataFrame or Series of the same size – it is not an aggregation!
z_mass = (penguins
.groupby('species')
['body_mass_g']
.transform(z_score))
z_mass
0 0.10 1 0.21 2 -1.00 ... 341 1.32 342 0.22 343 0.62 Name: body_mass_g, Length: 333, dtype: float64
penguins.assign(z_mass=z_mass)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | z_mass | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | 0.10 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | 0.21 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | -1.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male | 1.32 |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female | 0.22 |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male | 0.62 |
333 rows × 8 columns
display_df(penguins.assign(z_mass=z_mass), rows=8)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | z_mass | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | 0.10 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | 0.21 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | -1.00 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female | -0.56 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female | -0.49 |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male | 1.32 |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female | 0.22 |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male | 0.62 |
333 rows × 8 columns
Note that above, penguin 340 has a larger 'body_mass_g'
than penguin 0, but a lower 'z_mass'
.
- Penguin 0 has an above average
'body_mass_g'
among'Adelie'
penguins. - Penguin 340 has a below average
'body_mass_g'
among'Gentoo'
penguins. Remember from earlier that the average'body_mass_g'
of'Gentoo'
penguins is much higher than for other species.
penguins.groupby('species')['body_mass_g'].mean()
species Adelie 3706.16 Chinstrap 3733.09 Gentoo 5092.44 Name: body_mass_g, dtype: float64
Filtering groups¶
To keep only the groups that satisfy a particular condition, use the
filter
method on aDataFrameGroupBy
object.The
filter
method takes in a function, which itself takes in a DataFrame/Series and return a single Boolean. The result is a new DataFrame/Series with only the groups for which the filter function returnedTrue
.
For example, suppose we want only the 'species'
whose average 'bill_length_mm'
is above 39.
(penguins
.groupby('species')
.filter(lambda df: df['bill_length_mm'].mean() > 39)
)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
No more 'Adelie'
s!
Or, as another example, suppose we only want 'species'
with at least 100 penguins:
(penguins
.groupby('species')
.filter(lambda df: df.shape[0] > 100)
)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
265 rows × 7 columns
No more 'Chinstrap'
s!
Question 🤔
Answer the following questions about grouping:
- In
.agg(fn)
, what is the input tofn
? What is the output offn
? - In
.transform(fn)
, what is the input tofn
? What is the output offn
? - In
.filter(fn)
, what is the input tofn
? What is the output offn
?
Grouping with multiple columns¶
When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
species_and_island = (
penguins
.groupby(['species', 'island'])
[['bill_length_mm', 'body_mass_g']]
.mean()
)
species_and_island
bill_length_mm | body_mass_g | ||
---|---|---|---|
species | island | ||
Adelie | Biscoe | 38.98 | 3709.66 |
Dream | 38.52 | 3701.36 | |
Torgersen | 39.04 | 3708.51 | |
Chinstrap | Dream | 48.83 | 3733.09 |
Gentoo | Biscoe | 47.57 | 5092.44 |
Grouping and indexes¶
- The
groupby
method creates an index based on the specified columns. - When grouping by multiple columns, the resulting DataFrame has a
MultiIndex
. - Advice: When working with a
MultiIndex
, usereset_index
or setas_index=False
ingroupby
.
species_and_island
bill_length_mm | body_mass_g | ||
---|---|---|---|
species | island | ||
Adelie | Biscoe | 38.98 | 3709.66 |
Dream | 38.52 | 3701.36 | |
Torgersen | 39.04 | 3708.51 | |
Chinstrap | Dream | 48.83 | 3733.09 |
Gentoo | Biscoe | 47.57 | 5092.44 |
species_and_island['body_mass_g']
species island Adelie Biscoe 3709.66 Dream 3701.36 Torgersen 3708.51 Chinstrap Dream 3733.09 Gentoo Biscoe 5092.44 Name: body_mass_g, dtype: float64
species_and_island.loc['Adelie']
bill_length_mm | body_mass_g | |
---|---|---|
island | ||
Biscoe | 38.98 | 3709.66 |
Dream | 38.52 | 3701.36 |
Torgersen | 39.04 | 3708.51 |
species_and_island.loc[('Adelie', 'Torgersen')]
bill_length_mm 39.04 body_mass_g 3708.51 Name: (Adelie, Torgersen), dtype: float64
species_and_island.reset_index()
species | island | bill_length_mm | body_mass_g | |
---|---|---|---|---|
0 | Adelie | Biscoe | 38.98 | 3709.66 |
1 | Adelie | Dream | 38.52 | 3701.36 |
2 | Adelie | Torgersen | 39.04 | 3708.51 |
3 | Chinstrap | Dream | 48.83 | 3733.09 |
4 | Gentoo | Biscoe | 47.57 | 5092.44 |
(penguins
.groupby(['species', 'island'], as_index=False)
[['bill_length_mm', 'body_mass_g']]
.mean()
)
species | island | bill_length_mm | body_mass_g | |
---|---|---|---|---|
0 | Adelie | Biscoe | 38.98 | 3709.66 |
1 | Adelie | Dream | 38.52 | 3701.36 |
2 | Adelie | Torgersen | 39.04 | 3708.51 |
3 | Chinstrap | Dream | 48.83 | 3733.09 |
4 | Gentoo | Biscoe | 47.57 | 5092.44 |
Question 🤔
Find the most popular Male
and Female
baby Name
for each Year
in baby
. Exclude Year
s where there were fewer than 1 million births recorded.
baby_path = Path('data') / 'baby.csv'
baby = pd.read_csv(baby_path)
baby
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2085158 rows × 4 columns
# Your code goes here.
Pivot tables using the pivot_table
method¶
Pivot tables: an extension of grouping¶
Pivot tables are a compact way to display tables for humans to read:
Sex | F | M |
---|---|---|
Year | ||
2018 | 1698373 | 1813377 |
2019 | 1675139 | 1790682 |
2020 | 1612393 | 1721588 |
2021 | 1635800 | 1743913 |
2022 | 1628730 | 1733166 |
- Notice that each value in the table is a sum over the counts, split by year and sex.
- You can think of pivot tables as grouping using two columns, then "pivoting" one of the group labels into columns.
pivot_table
¶
The pivot_table
DataFrame method aggregates a DataFrame using two columns. To use it:
df.pivot_table(index=index_col,
columns=columns_col,
values=values_col,
aggfunc=func)
The resulting DataFrame will have:
- One row for every unique value in
index_col
. - One column for every unique value in
columns_col
. - Values determined by applying
func
on values invalues_col
.
last_5_years = baby.query('Year >= 2018')
last_5_years
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
159444 | Zyrie | M | 5 | 2018 |
159445 | Zyron | M | 5 | 2018 |
159446 | Zzyzx | M | 5 | 2018 |
159447 rows × 4 columns
last_5_years.pivot_table(
index='Year',
columns='Sex',
values='Count',
aggfunc='sum',
)
Sex | F | M |
---|---|---|
Year | ||
2018 | 1698373 | 1813377 |
2019 | 1675139 | 1790682 |
2020 | 1612393 | 1721588 |
2021 | 1635800 | 1743913 |
2022 | 1628730 | 1733166 |
# Look at the similarity to the snippet above!
(last_5_years
.groupby(['Year', 'Sex'])
[['Count']]
.sum()
)
Count | ||
---|---|---|
Year | Sex | |
2018 | F | 1698373 |
M | 1813377 | |
2019 | F | 1675139 |
... | ... | ... |
2021 | M | 1743913 |
2022 | F | 1628730 |
M | 1733166 |
10 rows × 1 columns
Example¶
Find the number of penguins per 'island'
and 'species'
.
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
penguins.pivot_table(
index='species',
columns='island',
values='bill_length_mm', # Choice of column here doesn't actually matter!
aggfunc='count',
)
island | Biscoe | Dream | Torgersen |
---|---|---|---|
species | |||
Adelie | 44.0 | 55.0 | 47.0 |
Chinstrap | NaN | 68.0 | NaN |
Gentoo | 119.0 | NaN | NaN |
Note that there is a NaN
at the intersection of 'Biscoe'
and 'Chinstrap'
, because there were no Chinstrap penguins on Biscoe Island.
We can either use the fillna
method afterwards or the fill_value
argument to fill in NaN
s.
penguins.pivot_table(
index='species',
columns='island',
values='bill_length_mm',
aggfunc='count',
fill_value=0,
)
island | Biscoe | Dream | Torgersen |
---|---|---|---|
species | |||
Adelie | 44 | 55 | 47 |
Chinstrap | 0 | 68 | 0 |
Gentoo | 119 | 0 | 0 |
Granularity, revisited¶
Take another look at the pivot table from the previous slide. Each row of the original penguins
DataFrame represented a single penguin, and each column represented features of the penguins.
What is the granularity of the DataFrame below?
penguins.pivot_table(
index='species',
columns='island',
values='bill_length_mm',
aggfunc='count',
fill_value=0,
)
island | Biscoe | Dream | Torgersen |
---|---|---|---|
species | |||
Adelie | 44 | 55 | 47 |
Chinstrap | 0 | 68 | 0 |
Gentoo | 119 | 0 | 0 |
Reshaping¶
pivot_table
reshapes DataFrames from "long" to "wide".- Other DataFrame reshaping methods:
melt
: Un-pivots a DataFrame. Very useful in data cleaning.pivot
: Likepivot_table
, but doesn't do aggregation.stack
: Pivots multi-level columns to multi-indices.unstack
: Pivots multi-indices to columns.- Google and the documentation are your friends!
Distributions¶
Joint distribution¶
When using aggfunc='count'
, a pivot table describes the joint distribution of two categorical variables. This is also called a contingency table.
counts = penguins.pivot_table(
index='species',
columns='sex',
values='body_mass_g',
aggfunc='count',
fill_value=0
)
counts
sex | Female | Male |
---|---|---|
species | ||
Adelie | 73 | 73 |
Chinstrap | 34 | 34 |
Gentoo | 58 | 61 |
We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as probabilities that a randomly selected penguin from the dataset belongs to a given combination of species and sex.
joint = counts / counts.sum().sum()
joint
sex | Female | Male |
---|---|---|
species | ||
Adelie | 0.22 | 0.22 |
Chinstrap | 0.10 | 0.10 |
Gentoo | 0.17 | 0.18 |
Marginal probabilities¶
If we sum over one of the axes, we can compute marginal probabilities, i.e. unconditional probabilities.
joint
sex | Female | Male |
---|---|---|
species | ||
Adelie | 0.22 | 0.22 |
Chinstrap | 0.10 | 0.10 |
Gentoo | 0.17 | 0.18 |
# Recall, joint.sum(axis=0) sums across the rows,
# which computes the sum of the **columns**.
joint.sum(axis=0)
sex Female 0.5 Male 0.5 dtype: float64
joint.sum(axis=1)
species Adelie 0.44 Chinstrap 0.20 Gentoo 0.36 dtype: float64
For instance, the second Series tells us that a randomly selected penguin has a 0.36 chance of being of species 'Gentoo'
.
Conditional probabilities¶
Using counts
, how might we compute conditional probabilities like $$P(\text{species } = \text{"Adelie"} \mid \text{sex } = \text{"Female"})?$$
counts
sex | Female | Male |
---|---|---|
species | ||
Adelie | 73 | 73 |
Chinstrap | 34 | 34 |
Gentoo | 58 | 61 |
➡️ Click here to see more of a derivation.
$$\begin{align*} P(\text{species} = c \mid \text{sex} = x) &= \frac{P(\text{species} = c \text{ and } \text{sex} = x)}{P(\text{sex = }x)} \\ &= \frac{\frac{\# \: (\text{species } = \: c \text{ and } \text{sex } = \: x)}{N}}{\frac{\# \: (\text{sex } = \: x)}{N}} \\ &= \frac{\# \: (\text{species} = c \text{ and } \text{sex} = x)}{\# \: (\text{sex} = x)} \end{align*}$$Answer: To find conditional probabilities of 'species'
given 'sex'
, divide by column sums. To find conditional probabilities of 'sex'
given 'species'
, divide by row sums.
Conditional probabilities¶
To find conditional probabilities of 'species'
given 'sex'
, divide by column sums. To find conditional probabilities of 'sex'
given 'species'
, divide by row sums.
counts
sex | Female | Male |
---|---|---|
species | ||
Adelie | 73 | 73 |
Chinstrap | 34 | 34 |
Gentoo | 58 | 61 |
counts.sum(axis=0)
sex Female 165 Male 168 dtype: int64
The conditional distribution of 'species'
given 'sex'
is below. Note that in this new DataFrame, the 'Female'
and 'Male'
columns each sum to 1.
counts / counts.sum(axis=0)
sex | Female | Male |
---|---|---|
species | ||
Adelie | 0.44 | 0.43 |
Chinstrap | 0.21 | 0.20 |
Gentoo | 0.35 | 0.36 |
For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of 'species'
'Adelie'
given that they are of 'sex'
'Female'
is 0.442424.
Question 🤔
Find the conditional distribution of 'sex'
given 'species'
.
Hint: Use .T
.
# Your code goes here.
Example: Grades¶
Two students, Lisa and Bart, just finished their first year at UCSD. They both took a different number of classes in Fall, Winter, and Spring.
Each quarter, Lisa had a higher GPA than Bart.
But Bart has a higher overall GPA.
How is this possible? 🤔
Run this cell to create DataFrames that contain each students' grades.
lisa = pd.DataFrame([[20, 46], [18, 54], [5, 20]],
columns=['Units', 'Grade Points Earned'],
index=['Fall', 'Winter', 'Spring'],
)
lisa.columns.name = 'Lisa' # This allows us to see the name "Lisa" in the top left of the DataFrame.
bart = pd.DataFrame([[5, 10], [5, 13.5], [22, 81.4]],
columns=['Units', 'Grade Points Earned'],
index=['Fall', 'Winter', 'Spring'],
)
bart.columns.name = 'Bart'
Quarter-specific vs. overall GPAs¶
Note: The number of "grade points" earned for a course is
$$\text{number of units} \cdot \text{grade (out of 4)}$$For instance, an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.
dfs_side_by_side(lisa, bart)
Lisa | Units | Grade Points Earned |
---|---|---|
Fall | 20 | 46 |
Winter | 18 | 54 |
Spring | 5 | 20 |
Bart | Units | Grade Points Earned |
---|---|---|
Fall | 5 | 10.0 |
Winter | 5 | 13.5 |
Spring | 22 | 81.4 |
Lisa had a higher GPA in all three quarters.
quarterly_gpas = pd.DataFrame({
"Lisa's Quarter GPA": lisa['Grade Points Earned'] / lisa['Units'],
"Bart's Quarter GPA": bart['Grade Points Earned'] / bart['Units'],
})
quarterly_gpas
Lisa's Quarter GPA | Bart's Quarter GPA | |
---|---|---|
Fall | 2.3 | 2.0 |
Winter | 3.0 | 2.7 |
Spring | 4.0 | 3.7 |
Question 🤔 (Answer at dsc80.com/q)
Use the DataFrame lisa
to compute Lisa's overall GPA, and use the DataFrame bart
to compute Bart's overall GPA.
# Helper function to show lisa and bart side-by-side to save screen space
dfs_side_by_side(lisa, bart)
Lisa | Units | Grade Points Earned |
---|---|---|
Fall | 20 | 46 |
Winter | 18 | 54 |
Spring | 5 | 20 |
Bart | Units | Grade Points Earned |
---|---|---|
Fall | 5 | 10.0 |
Winter | 5 | 13.5 |
Spring | 22 | 81.4 |
# Your code goes here.
What happened?¶
(quarterly_gpas
.assign(Lisa_Units=lisa['Units'],
Bart_Units=bart['Units'])
.iloc[:, [0, 2, 1, 3]]
)
Lisa's Quarter GPA | Lisa_Units | Bart's Quarter GPA | Bart_Units | |
---|---|---|---|---|
Fall | 2.3 | 20 | 2.0 | 5 |
Winter | 3.0 | 18 | 2.7 | 5 |
Spring | 4.0 | 5 | 3.7 | 22 |
When Lisa and Bart both performed poorly, Lisa took more units than Bart. This brought down 📉 Lisa's overall average.
When Lisa and Bart both performed well, Bart took more units than Lisa. This brought up 📈 Bart's overall average.
Simpson's paradox¶
Simpson's paradox occurs when grouped data and ungrouped data show opposing trends.
- It is named after Edward H. Simpson, not Lisa or Bart Simpson.
It often happens because there is a hidden factor (i.e. a confounder) within the data that influences results.
Question: What is the "correct" way to summarize your data? What if you had to act on these results?
show_paradox_slides()
What happened?¶
The overall acceptance rate for women (30%) was lower than it was for men (45%).
However, most departments (A, B, D, F) had a higher acceptance rate for women.
Department A had a 62% acceptance rate for men and an 82% acceptance rate for women!
- 31% of men applied to Department A.
- 6% of women applied to Department A.
Department F had a 6% acceptance rate for men and a 7% acceptance rate for women!
- 14% of men applied to Department F.
- 19% of women applied to Department F.
Conclusion: Women tended to apply to departments with a lower acceptance rate; the data don't support the hypothesis that there was major gender discrimination against women.
Example: Restaurant reviews and phone types¶
You are deciding whether to eat at Dirty Birds or The Loft.
Suppose Yelp shows ratings aggregated by phone type (Android vs. iPhone).
Phone Type | Stars for Dirty Birds | Stars for The Loft |
---|---|---|
Android | 4.24 | 4.0 |
iPhone | 2.99 | 2.79 |
All | 3.32 | 3.37 |
Question: Should you choose Dirty Birds or The Loft?
Answer: The type of phone you use likely has nothing to do with your taste in food – pick the restaurant that is rated higher overall.
Rule of thumb 👍¶
Let $(X, Y)$ be a pair of variables of interest. Simpson's paradox occurs when the association between $X$ and $Y$ reverses when we condition on $Z$, a third variable.
If $Z$ has a causal connection to both $X$ and $Y$, we should condition on $Z$ and use deaggregated data.
If not, we shouldn't condition on $Z$ and use the aggregated data instead.
Berkeley gender discrimination: $X$ is gender, $Y$ is acceptance rate. $Z$ is the department.
- $Z$ has a plausible causal effect on both $X$ and $Y$, so we should condition on $Z$.
Yelp ratings: $X$ is the restaurant, $Y$ is the average stars. $Z$ is the phone type.
- $Z$ doesn't plausibly cause $X$ to change, so we should not condition on $Z$.
Takeaways¶
Be skeptical of...
- Aggregate statistics.
- People misusing statistics to "prove" that discrimination doesn't exist.
- Drawing conclusions from individual publications ($p$-hacking, publication bias, narrow focus, etc.).
- Everything!
We need to apply domain knowledge and human judgement calls to decide what to do when Simpson's paradox is present.
Really?¶
To handle Simpson's paradox with rigor, we need some ideas from causal inference which we don't have time to cover in DSC 80. This video has a good example of how to approach Simpson's paradox using a minimal amount of causal inference, if you're curious (not required for DSC 80).
IFrame('https://www.youtube-nocookie.com/embed/zeuW1Z2EtLs?si=l2Dl7P-5RCq3ODpo',
width=800, height=450)
Further reading¶
- What is Simpson's Paradox?
- Understanding Simpson's Paradox
- Requires more statistics background, but gives a rigorous understanding of when to use aggregated vs. unaggregated data.