Lecture 7 – Grouping and Pivoting

DSC 80, Spring 2022

Announcements

Agenda

Data granularity

Granularity

Name Assignment Score
Billy Homework 1 94
Sally Homework 1 98
Molly Homework 1 82
Sally Homework 2 47

Levels of granularity

Each student submits CAPEs once for each course they are in.

Student Name Quarter Course Instructor Recommend? Expected Grade Hours Per Week Comments
Billy SP22 DSC 80 Suraj Rampure No A- 14 I hate this class
Billy SP22 DSC 40B Arya Mazumdar Yes B+ 9 go big O
Sally SP22 DSC 10 Janine Tiefenbruck Yes A 11 babypandas are so cute
Molly SP22 DSC 80 Suraj Rampure Yes A+ 2 I wish there was music in class

Only instructors can see individual responses. At cape.ucsd.edu, only overall class statistics are visible.

Quarter Course Instructor Recommend (%) Expected Grade Hours Per Week
SP22 DSC 80 Suraj Rampure 23% 3.15 (B) 13.32
SP22 DSC 40B Arya Mazumdar 89% 3.35 (B+) 8.54
SP22 DSC 10 Janine Tiefenbruck 94% 3.45 (B+) 11.49

The university may be interested in looking at CAPEs results by department.

Quarter Department Recommend (%) Expected Grade Hours Per Week
SP22 DSC 91% 3.01 (B) 12.29
SP22 BILD 85% 2.78 (C+) 13.21

Prospective students may be interested in comparing course evaluations across different universities.

University Recommend (%) Average GPA Hours Per Week
UC San Diego 94% 3.12 (B) 42.19
UC Irvine 89% 3.15 (B) 38.44
SDSU 88% 2.99 (B-) 36.89

Collecting data

Manipulating granularity

Discussion Question

What is the average number of 'Years' for each 'Degree'? Write code that finds the answer as a Series indexed by 'Degree'.

Approach 1: Looping through unique values

For each unique 'Degree', we make a pass through the entire dataset.

Approach 2: Single pass

Let's try and avoid passing over the dataset repeatedly.

You can iterate over the rows of a DataFrame using the iterrows method (though you should rarely need to do this):

Issues with the previous solutions

GroupBy

🤔

Aside: Pandas Tutor

Split-apply-combine

Runtime considerations

Example: Penguins 🐧

For each species...

What is the median bill length?

What proportion live on Dream Island?

Now that we understand how to use groupby, let's dive deeper into how it works.

Accessing groups

Aggregation

Column selection

Additional GroupBy methods

Aggregation methods

The aggregate method

Example

How many penguins are there of each species, and what is the mean body mass of each species?

Note what happens when we don't select a column before aggregating.

Example

What is the max bill length of each species, and how many islands is each species found on?

Example

What is the interquartile range of the body mass of each species?

The transform method

The filter method

No more Adelies!

The apply method

Discussion Question

For each species, find the island on which the heaviest penguin of that species lives.

Grouping with multiple columns

When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.

Grouping and indexes

Introduction to pivot tables

Average body mass for every combination of species and island

To find the above information, we can group by both 'species' and 'island'.

But we can also create a pivot table.

Note that the DataFrame above shows the same information as the Series above it, just in a different arrangement.

Summary, next time

Summary