Lecture 10 – Grouping with Subgroups, Merging

DSC 10, Fall 2022

Announcements

Agenda

Grouping with subgroups

DSC 10 student data

Recall, last class, we extracted the first name of each student in the class.

How many students named 'Ethan' are in each section?

We discovered that 'Ethan' is the most popular first name overall.

To find the number of 'Ethan's in each lecture section, we can query for only the rows corresponding to 'Ethan's, and then group by 'section'.

But what if we want to know the number of 'Emily's and 'Yuen's per section, too?

Is there a way to do this for all first names and sections all at once?

How many students with each first name does each lecture section have?

The above DataFrame is telling us, for instance, that there is 1 student with the first name 'Adrian' in the 10AM section.

It is not saying that there is only 1 'Adrian' in the course overall – in fact, there are 2!

.groupby with subgroups


df.groupby([col_1, col_2, ..., col_k])

Notice the index... 🤔

Does order matter?

Answer: Kind of. The order of the rows and columns will be different, but the content will be the same.

Activity

Using counts, find the lecture section with the most 'Ryan's.

Activity

Using counts, find the longest first name in the class that is shared by at least two students in the same section.

Note: This was an activity in the last class. There, we had to use our shared_first_and_section function; that's not needed here!

New dataset: Sea temperatures 🌊

This dataset contains the sea surface temperature in La Jolla, on many days ranging from August 22, 1916 to December 31, 2020.

Concept Check ✅ – Answer at cc.dsc10.com

We want to find the single month (e.g. November 1998) with the highest average 'SURFACE_TEMP'.

Which of the following would help us achieve this goal?

A. sea_temp.groupby('SURFACE_TEMP').mean()

B. sea_temp.groupby('MONTH').mean()

C. sea_temp.groupby(['YEAR', 'MONTH']).mean()

D. sea_temp.groupby(['MONTH', 'DAY']).mean()

E. sea_temp.groupby(['MONTH', 'SURFACE_TEMP']).mean()

Plots of monthly and yearly average surface temperature 📈

Summary: .groupby with subgroups

Merging 🚗

Question: If I sell all of the phones in my inventory, how much will I make in revenue?

If I sell all of the phones in my inventory, how much will I make in revenue?

What just happened!? 🤯

.merge

If I sell all of the phones in my inventory, how much will I make in revenue?

Shortcut if column names are the same: on

Notice: There's only one column containing phone names now.

Does order matter? 🤔

Answer: The order of the rows and columns will be different, but the content will be the same.

What if we want to "merge on" an index?

Instead of using left_on or right_on, use left_index=True or right_index=True.

Activity setup

Concept Check ✅ – Answer at cc.dsc10.com

Without writing code, how many rows are in nice_weather_cities.merge(schools, on='city')?

A. 4    B. 5    C. 6    D. 7    E. 8

Followup activity

Without writing code, how many rows are in nice_weather_cities.merge(schools, on='state')?

Hint: It's more than you might guess at first!

Summary, next time

Summary

Next time