Lecture 5 – Pivoting and Simpson's Paradox

DSC 80, Winter 2023

Announcements

Agenda

Grouping

Recall, last class, we started working with a dataset that involves various measurements taken of three species of penguins in Antarctica.

Discussion Question

Find the 'island' on which the heaviest penguin of each 'species' lives.

Grouping with multiple columns

When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.

Grouping and indexes

Pivoting

Mean body mass for every combination of species and island

We just found the above information by grouping by both 'species' and 'island'.

But we can also create a pivot table, which contains the same information in a different orientation.

Let's visualize how the pivot table was created using Pandas Tutor.

pivot_table

The pivot_table DataFrame method aggregates a DataFrame using two columns. To use it:

df.pivot_table(index=index_col,
               columns=columns_col,
               values=values_col,
               aggfunc=func)

The resulting DataFrame will have:

Example

Find the number of penguins per island and species.

Note that there is a NaN at the intersection of 'Biscoe' and 'Chinstrap', because there were no Chinstrap penguins on Biscoe Island.

We can either use the fillna method afterwards or the fill_value argument to fill in NaNs.

Example

Find the median body mass per species and sex.

Important: In penguins, each row corresponds to an individual/observation. In the pivot table above, that is no longer true.

Distributions

Joint distribution

When using aggfunc='count', a pivot table describes the joint distribution of two categorical variables. This is also called a contingency table.

We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as probabilities that a randomly selected penguin from the dataset belongs to a given combination of species and sex.

Marginal probabilities

If we sum over one of the axes, we can compute marginal probabilities, i.e. unconditional probabilities.

For instance, the second Series tells us that a randomly selected penguin has a 0.357357 chance of being of species 'Gentoo'.

Conditional probabilities

Using counts, how might we compute conditional probabilities like $$P(\text{species } = \text{"Adelie"} \mid \text{sex } = \text{"Female"})?$$

$$\begin{align*} P(\text{species} = c \mid \text{sex} = x) &= \frac{P(\text{species} = c \text{ and } \text{sex} = x)}{P(\text{sex = }x)} \\ &= \frac{\frac{\# \: (\text{species } = \: c \text{ and } \text{sex } = \: x)}{N}}{\frac{\# \: (\text{sex } = \: x)}{N}} \\ &= \frac{\# \: (\text{species} = c \text{ and } \text{sex} = x)}{\# \: (\text{sex} = x)} \end{align*}$$

Answer: To find conditional probabilities of species given sex, divide by column sums. To find conditional probabilities of sex given species, divide by row sums.

Conditional probabilities

To find conditional probabilities of species given sex, divide by column sums. To find conditional probabilities of sex given species, divide by row sums.

The conditional distribution of species given sex is below. Note that in this new DataFrame, the 'Female' and 'Male' columns each sum to 1.

For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of species 'Adelie' given that they are of sex 'Female' is 0.442424.

Task: Try and find the conditional distribution of sex given species.

Reshaping DataFrames

pivot_table aggregates and reshapes

Example: Tic-tac-toe

The pivot method only reshapes a DataFrame. It does not change any of the values in it (i.e. aggfunc doesn't work with pivot).

pivot_table = groupby + pivot

aggfunc='mean' plays the same role that .mean() does.

Reshaping

Simpson's paradox

Example: Grades

Run this cell to create DataFrames that contain each students' grades.

Quarter-specific vs. overall GPAs

Note: The number of "grade points" earned for a course is

$$\text{number of units} \cdot \text{grade (out of 4)}$$

For instance, an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.

Lisa had a higher GPA in all three quarters:

But Lisa's overall GPA was less than Bart's overall GPA:

What happened?

Simpson's paradox

Example: How Berkeley was almost sued for gender discrimination (1973)

What do you notice?

What happened?

Caution!

This doesn't mean that admissions are free from gender discrimination!

From Moss-Racusin et al., 2012, PNAS (cited 2600+ times):

In a randomized double-blind study (n = 127), science faculty from research-intensive universities rated the application materials of a student—who was randomly assigned either a male or female name—for a laboratory manager position. Faculty participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.

But then...

From Williams and Ceci, 2015, PNAS:

Here we report five hiring experiments in which faculty evaluated hypothetical female and male applicants, using systematically varied profiles disguising identical scholarship, for assistant professorships in biology, engineering, economics, and psychology. Contrary to prevailing assumptions, men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

Do these conflict?

Not necessarily. One explanation, from William and Ceci:

Instead, past studies have used ratings of students’ hirability for a range of posts that do not include tenure-track jobs, such as managing laboratories or performing math assignments for a company. However, hiring tenure-track faculty differs from hiring lower-level staff: it entails selecting among highly accomplished candidates, all of whom have completed Ph.D.s and amassed publications and strong letters of support. Hiring bias may occur when applicants’ records are ambiguous, as was true in studies of hiring bias for lower-level staff posts, but such bias may not occur when records are clearly strong, as is the case with tenure-track hiring.

Do these conflict?

From Witteman, et al, 2019, in The Lancet:

Thus, evidence of scientists favouring women comes exclusively from hypothetical scenarios, whereas evidence of scientists favouring men comes from hypothetical scenarios and real behaviour. This might reflect academics' growing awareness of the social desirability of achieving gender balance, while real academic behaviour might not yet put such ideals into action.

Example: Restaurant reviews and phone types

Phone Type Stars for Dirty Birds Stars for The Loft
Android 4.24 4.0
iPhone 2.99 2.79
All 3.32 3.37

Verifying Simpson's paradox

Aggregated means:

Disaggregated means:

Takeaways

Be skeptical of...

Further reading

Aside: Working with time series data

Time series – why now?

Datetime types

When working with time data, you will see two different kinds of "times":

The datetime module

Python has an in-built datetime module, which contains datetime and timedelta types. These are much more convenient to deal with than strings that contain times.

Unix timestamps count the number of seconds since January 1st, 1970.

Times in pandas

Timestamps have time-related attributes, e.g. dayofweek, hour, min, sec.

Subtracting timestamps yields pd.Timedelta objects.

Example: Exam speeds

Below, we have the Final Exam starting and ending times for two sections of a course.

Question: Who took the longest time to finish the exam?

Summary, next time

Summary

Next time

Combining DataFrames.