Lecture 4 – Grouping

DSC 80, Spring 2023

Agenda

Data granularity

Granularity

Name Assignment Score
Billy Homework 1 94
Sally Homework 1 98
Molly Homework 1 82
Sally Homework 2 47

Levels of granularity

Each student submits CAPEs once for each course they are in.

Student Name Quarter Course Instructor Recommend? Expected Grade Hours Per Week Comments
Willy SP23 DSC 80 Tauhidur Rahman No A- 14 I hate this class
Billy WI23 DSC 80 Suraj Rampure No A- 12 Fantastic
Billy WI23 DSC 40B Justin Eldridge Yes B+ 9 go big O
Sally WI23 DSC 10 Janine Tiefenbruck Yes A 11 baby pandas are so cute
Molly WI23 DSC 80 Suraj Rampure Yes A+ 2 I wish there was music in class
Molly WI23 DSC 95 Marina Langlois No A 3 I loved DSC 30, but 95 wasn't hard enough :(

Only instructors can see individual responses. At cape.ucsd.edu, overall class statistics are visible.

Quarter Course Instructor Recommend (%) Expected Grade Hours Per Week
SP23 DSC 80 Tauhidur Rahman 6% 3.00 (B) 15.32
WI23 DSC 80 Suraj Rampure 96% 3.15 (B) 13.32
WI23 DSC 40B Justin Eldridge 89% 3.35 (B+) 8.54
WI23 DSC 10 Janine Tiefenbruck 94% 3.45 (B+) 11.49
WI23 DSC 95 Marina Langlois 91% 4.0 (A) 9.21

btw, this is not real data :) .

The university may be interested in looking at CAPEs results by department.

Quarter Department Recommend (%) Expected Grade Hours Per Week
WI23 DSC 91% 3.01 (B) 12.29
WI23 BILD 85% 2.78 (C+) 13.21

Prospective students may be interested in comparing course evaluations across different universities.

University Recommend (%) Average GPA Hours Per Week
UC San Diego 94% 3.12 (B) 42.19
UC Irvine 89% 3.15 (B) 38.44
SDSU 88% 2.99 (B-) 36.89

Collecting data

Manipulating granularity

Example: Penguins

Artwork by @allison_horst

The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.

Discussion Question

Create a Series, indexed by 'species', that contains the mean 'body_mass_g' of each 'species'.

Naive approach: looping through unique values

Grouping

🤔

Somehow, the groupby method computes what we're looking for in just one line. How?

Aside: Pandas Tutor

Pandas Tutor is the equivalent of Python Tutor, which you saw in DSC 20, but for pandas.

You can use it by:

We'll do the latter, since it doesn't require us leaving our Jupyter Notebook (and because the website doesn't allow us to upload datasets).

"Split-apply-combine" paradigm

The groupby method involves three steps: split, apply, and combine. This is the same terminology that the pandas documentation uses.

More examples

Before we dive deep into the internals, let's look at a few more examples.

Which 'species' has the highest median 'bill_length_mm'?

What proportion of penguins of each 'species' live on 'Dream' island?

DataFrameGroupBy objects and aggregation

DataFrameGroupBy objects

We've just evaluated a few expressions of the following form.

There are two method calls in the expression above: .groupby('species') and .mean(). What happens if we remove the latter?

Peeking under the hood

If df is a DataFrame, then df.groupby(key) returns a DataFrameGroupBy object.

This object represents the "split" in "split-apply-combine".

DataFrameGroupBy objects have a groups attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.

DataFrameGroupBy objects also have a get_group(key) method, which returns a DataFrame with only the values for the given key.

We usually don't use these attributes and methods, but they're useful in understanding how groupby works under the hood.

Aggregation

Let's look at some examples.

Column independence

Within each group, the aggregation method is applied to each column independently.

It is not telling us that there is an 'Adelie' penguin on 'Torgersen' island with a 'body_mass_g' of 4775.0!

Discussion Question

Find the 'island' on which the heaviest penguin of each 'species' lives.

Column selection and performance implications

To demonstrate that the former is slower than the latter, we can use %%timeit. For reference, we'll also include our earlier for-loop-based solution.

Takeaways

Beyond default aggregation methods

The aggregate method

Example

How many penguins are there of each 'species', and what is the mean 'body_mass_g' of each species?

Note what happens when we don't select a column before aggregating.

Example

What is the maximum 'bill_length_mm' of each species, and which 'island's is each 'species' found on?

Example

What is the interquartile range of the 'body_mass_g' of each 'species'?

Other DataFrameGroupBy methods

Split-apply-combine, revisited

When we introduced the split-apply-combine pattern, the "apply" step involved aggregation – our final DataFrame had one row for each group.

Instead of aggregating during the apply step, we could instead perform a:

Transformations

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$

Note that below, penguin 276 has a larger 'body_mass_g' than penguin 65, but a lower 'z_mass'.

Filtering

For example, suppose we want only the 'species' whose 'bill_length_mm' is above 39.

No more 'Adelie's!

Or, as another example, suppose we only want the 'species' with at least 100 penguins:

No more 'Chinstrap's!

Grouping with multiple columns

When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.

Grouping and indexes

Summary, next time

Summary, next time