Lecture 8 – Pivoting, Simpson's Paradox, and Concatenation

DSC 80, Spring 2022

Announcements

Agenda

Grouping

🐧

Discussion Question

For each species, find the island on which the heaviest penguin of that species lives.

Grouping with multiple columns

When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.

Grouping and indexes

Pivoting

Average body mass for every combination of species and island

To find the above information, we can group by both 'species' and 'island'.

But we can also create a pivot table.

Note that the DataFrame above shows the same information as the Series above it, just in a different arrangement.

pivot_table

df.pivot_table(index=index_col,
               columns=columns_col,
               values=values_col,
               aggfunc=func)

Example

Find the number of penguins per island and species.

Note that there is a NaN at the intersection of 'Biscoe' and 'Chinstrap', because there were no Chinstrap penguins on Biscoe Island.

We can either use the fillna method afterwards or the fill_values argument to fill in NaNs.

Example

Find the mean body mass per species and sex.

Important: In penguins, each row corresponds to an individual/observation. In the pivot table above, that is no longer true.

Joint and conditional distributions

When using aggfunc='count', a pivot table describes the joint distribution of two categorical variables.

We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as probabilities that a randomly selected penguin from the dataset belongs to a given combination of species and sex.

If we sum over one of the axes, we can compute marginal probabilities.

For instance, the first Series tells us that a randomly selected penguin has a 0.357357 chance of being of species 'Gentoo'.

If we divide counts by row or column sums, we can compute conditional probabilities.

The conditional distribution of species given sex is below.

For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of species 'Adelie' given that they are of sex 'Female' is 0.442424.

The conditional distribution of sex given species is below.

pivot_table aggregates and reshapes

Example: Tic-tac-toe

The pivot method only reshapes a DataFrame. It does not change any of the values in it (i.e. aggfunc doesn't work with pivot).

pivot_table = groupby + pivot

aggfunc='mean' plays the same role that .mean() does.

Reshaping

Simpson's paradox

Example: Grades

Note: The number of "grade points" you earn for a course is

$$\text{number of units} \cdot \text{grade (out of 4)}$$

So an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.

The following DataFrame shows that Lisa had a higher GPA in all three quarters.

But Lisa's overall GPA is less than Bart's overall GPA.

What happened?

Simpson's paradox

Example: How Berkeley was sued for gender discrimination (1973)

What do you notice?

What happened?

Caution!

This doesn't mean that admissions are free from gender discrimination!

From Moss-Racusin et al., 2012, PNAS (cited 2600+ times):

In a randomized double-blind study (n = 127), science faculty from research-intensive universities rated the application materials of a student—who was randomly assigned either a male or female name—for a laboratory manager position. Faculty participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.

But then...

From Williams and Ceci, 2015, PNAS:

Here we report five hiring experiments in which faculty evaluated hypothetical female and male applicants, using systematically varied profiles disguising identical scholarship, for assistant professorships in biology, engineering, economics, and psychology. Contrary to prevailing assumptions, men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

Do these conflict?

Not necessarily. One explanation, from William and Ceci:

Instead, past studies have used ratings of students’ hirability for a range of posts that do not include tenure-track jobs, such as managing laboratories or performing math assignments for a company. However, hiring tenure-track faculty differs from hiring lower-level staff: it entails selecting among highly accomplished candidates, all of whom have completed Ph.D.s and amassed publications and strong letters of support. Hiring bias may occur when applicants’ records are ambiguous, as was true in studies of hiring bias for lower-level staff posts, but such bias may not occur when records are clearly strong, as is the case with tenure-track hiring.

Do these conflict?

From Witteman, et al, 2019, in The Lancet:

Thus, evidence of scientists favouring women comes exclusively from hypothetical scenarios, whereas evidence of scientists favouring men comes from hypothetical scenarios and real behaviour. This might reflect academics' growing awareness of the social desirability of achieving gender balance, while real academic behaviour might not yet put such ideals into action.

Example: Restaurant reviews and phone types

Phone Type Stars for Dirty Birds Stars for The Loft
Android 4.24 4.0
iPhone 2.99 2.79
All 3.32 3.37

Restaurant reviews and phone types

Verifying Simpson's paradox

Aggregated means:

Disaggregated means:

Takeaways

Be skeptical of...

Further reading

Concatenating vertically

Segue

For the rest of this week, we will look at how to combine multiple DataFrames.

Data spread across multiple files

The SSA baby names dataset from Lecture 1/2 was downloaded as multiple files – one per year.

Question: How do we combine multiple datasets?

Row-wise combination of data: pd.concat

Example: Grades

By default, pd.concat stacks DataFrames row-wise, i.e. on top of one another.

Let's use pd.concat on a list of the above two DataFrames.

Setting the optional argument ignore_index to True fixes the index (which .reset_index() also could do).

To keep track of which original DataFrame each row came from, we can use the keys optional argument.

The resulting DataFrame has a MultiIndex, though.

Missing columns?

If we concatenate two DataFrames that don't share the same column names, NaNs are added in the columns that aren't shared.

⚠️ Warning: No loops!

total = pd.DataFrame()
for df in dataframes:
    total = total.concat(df)

Aside: accessing file names programmatically

The following does something similar.

Aside: Working with time series data

Time series – why now?

Datetime types

When working with time data, you will see two different kinds of "times":

The datetime module

Python has an in-built datetime module, which contains datetime and timedelta types. These are much more convenient to deal with than strings that contain times.

Recall, Unix timestamps count the number of seconds since January 1st, 1970.

Times in pandas

Timestamps have time-related attributes, e.g. dayofweek, hour, min, sec.

Subtracting timestamps yields pd.Timedelta objects.

Timestamps in DataFrames

Example: Exam speeds

Below, we have the Final Exam starting and ending times for two sections of a course.

Question: Who finished the exam the fastest amongst all students in the course?

Approach:

  1. Concatenate the two DataFrames.
  2. Convert the time columns to pd.Timestamp.
  3. Find the difference between 'finish_exam' and 'start_exam'.
  4. Sort.
  5. Pick the fastest exam taker.

Summary, next time

Summary