In [1]:
# Run this cell to set up packages for lecture.
from lec09_imports import *

Lecture 9 – Grouping on Multiple Columns, Merging¶

DSC 10, Winter 2025¶

Agenda¶

  • Grouping on multiple columns.
  • Merging.

Grouping on multiple columns¶

DSC 10 student data¶

In [6]:
roster = bpd.read_csv('data/roster-anon.csv')
roster
Out[6]:
name section
0 Shawn Hhnxoq 10AM
1 Tom Egzuaz 11AM
2 Jiahao Zvwwyb 11AM
... ... ...
237 Jason Eglntp 11AM
238 Renee Fhlaos 11AM
239 Vivek Tbedny 11AM

240 rows × 2 columns

Recall, last class, we extracted the first name of each student in the class.

In [9]:
def first_name(full_name):
    '''Returns the first name given a full name.'''
    return full_name.split(' ')[0]
In [12]:
roster = roster.assign(
    first=roster.get('name').apply(first_name)
)
roster
Out[12]:
name section first
0 Shawn Hhnxoq 10AM Shawn
1 Tom Egzuaz 11AM Tom
2 Jiahao Zvwwyb 11AM Jiahao
... ... ... ...
237 Jason Eglntp 11AM Jason
238 Renee Fhlaos 11AM Renee
239 Vivek Tbedny 11AM Vivek

240 rows × 3 columns

How many students named Ryan are in each section?¶

We discovered that Ryan and Andrew were tied for the most popular first names overall.

In [15]:
name_counts = (
    roster
    .groupby('first')
    .count()
    .sort_values('name', ascending=False)
    .get(['name'])
)
name_counts
Out[15]:
name
first
Ryan 4
Andrew 4
Nathan 3
... ...
Jiahao 1
Jimbo 1
Zora 1

212 rows × 1 columns

To find the number of 'Ryan's in each lecture section, we can query for only the rows corresponding to 'Ryan's, and then group by 'section'.

In [18]:
roster[roster.get('first') == 'Ryan'].groupby('section').count()
Out[18]:
name first
section
10AM 1 1
11AM 3 3

But what if we want to know the number of 'Vanessa's and 'Nathan's per section, too?

In [21]:
roster[roster.get('first') == 'Vanessa'].groupby('section').count()
Out[21]:
name first
section
10AM 2 2
11AM 1 1
In [23]:
roster[roster.get('first') == 'Nathan'].groupby('section').count()
Out[23]:
name first
section
10AM 2 2
11AM 1 1

Is there a way to do this for all first names and sections all at once?

How many students with each first name does each lecture section have?¶

  • Right now, we can count the number of students with each first name, by grouping roster by 'first'.
In [28]:
# One row per unique first name.
roster.groupby('first').count().get(['name'])
Out[28]:
name
first
Aarav 1
Abduboriyjon 1
Achintya 1
... ...
Zixuan 1
Ziyang 1
Zora 1

212 rows × 1 columns

  • We can also count the number of students in each lecture section, by grouping roster by 'section'.
In [31]:
# One row per unique section.
roster.groupby('section').count().get(['name'])
Out[31]:
name
section
10AM 125
11AM 115
  • However, neither of the above DataFrames give us the number of students with each first name in each section.
    • For instance, neither result tells me the number of 'Ryan's in the 9AM section or the number of 'Vanessa's in the 10AM section.
  • It would be nice if we could group by both 'first' and 'section' – and we can!

Grouping on multiple columns¶

In [36]:
roster
Out[36]:
name section first
0 Shawn Hhnxoq 10AM Shawn
1 Tom Egzuaz 11AM Tom
2 Jiahao Zvwwyb 11AM Jiahao
... ... ... ...
237 Jason Eglntp 11AM Jason
238 Renee Fhlaos 11AM Renee
239 Vivek Tbedny 11AM Vivek

240 rows × 3 columns

We can pass a list of column names to .groupby!

In [39]:
roster.groupby(['section', 'first']).count()
Out[39]:
name
section first
10AM Abduboriyjon 1
Achintya 1
Agustin 1
... ... ...
11AM Yunshan 1
Ziyang 1
Zora 1

226 rows × 1 columns

The above DataFrame is telling us, for instance, that there is 1 student with the first name 'Zora' in the 11AM section.

It is not saying that there is only one 'Zora' in the course overall. There could be more in the other sections.

Grouping on multiple columns¶

  • To group on multiple columns, pass a list of column names to .groupby:

df.groupby(['col_1', 'col_2', ..., 'col_k'])
  • Group by 'col_1' first. Within each group, group by 'col_2', and so on.
  • Important: The resulting DataFrame has one row per unique combination of entries in the specified columns.
    • On the previous slide, we had exactly one row for every combination of 'section' and 'first'.
  • Formally, when we group on multiple columns, we are creating subgroups – that is, groups within groups.
    • On the previous slide, we first grouped by 'section', and within each section, we grouped by 'first'.

Notice the index... 🤔¶

  • This is called a "MultiIndex".
    • The DataFrame is indexed by 'section' and 'first'.
  • We won't worry about the details of MultiIndexes.
  • We can use .reset_index() to "flatten" our DataFrame back to normal.
In [47]:
roster.groupby(['section', 'first']).count().reset_index()
Out[47]:
section first name
0 10AM Abduboriyjon 1
1 10AM Achintya 1
2 10AM Agustin 1
... ... ... ...
223 11AM Yunshan 1
224 11AM Ziyang 1
225 11AM Zora 1

226 rows × 3 columns

Does order matter?¶

In [49]:
roster.groupby(['section', 'first']).count().reset_index()
Out[49]:
section first name
0 10AM Abduboriyjon 1
1 10AM Achintya 1
2 10AM Agustin 1
... ... ... ...
223 11AM Yunshan 1
224 11AM Ziyang 1
225 11AM Zora 1

226 rows × 3 columns

In [50]:
roster.groupby(['first', 'section']).count().reset_index()
Out[50]:
first section name
0 Aarav 11AM 1
1 Abduboriyjon 10AM 1
2 Achintya 10AM 1
... ... ... ...
223 Zixuan 10AM 1
224 Ziyang 11AM 1
225 Zora 11AM 1

226 rows × 3 columns

Answer: Kind of. The order of the rows and columns will be different, but the content will be the same.

Activity¶

  1. Using counts, find the lecture section with the most 'Amelia's.
  2. Using counts, find the longest first name in the class that is shared by at least two students in the same section.
✅ Click here to see the solutions after you've tried both problems.

amelia_counts = counts[counts.get('first') == 'Amelia']
amelia_counts.sort_values('name', ascending=False).get('section').iloc[0]

with_len = counts.assign(length=counts.get('first').apply(len))
with_len[with_len.get('name') >= 2].sort_values('length', ascending=False).get('first').iloc[0]

In [54]:
counts = roster.groupby(['section', 'first']).count().reset_index()
counts
Out[54]:
section first name
0 10AM Abduboriyjon 1
1 10AM Achintya 1
2 10AM Agustin 1
... ... ... ...
223 11AM Yunshan 1
224 11AM Ziyang 1
225 11AM Zora 1

226 rows × 3 columns

In [ ]:
 
In [ ]:
 

Example: Sea temperatures 🌊¶

This dataset contains the sea surface temperature in La Jolla, on many days ranging from August 22, 1916 to June 30, 2024.

In [58]:
sea_temp = bpd.read_csv('data/sea_temp.csv')
sea_temp
Out[58]:
YEAR MONTH DAY SURFACE_TEMP
0 1916 8 22 19.5
1 1916 8 23 19.9
2 1916 8 24 19.7
... ... ... ... ...
38088 2024 6 27 21.3
38089 2024 6 28 20.8
38090 2024 6 30 21.5

38091 rows × 4 columns

Concept Check ✅ – Answer at cc.dsc10.com¶

We want to find the single month (e.g. November 1998) with the highest average 'SURFACE_TEMP'.

Which of the following would help us achieve this goal?

A. sea_temp.groupby('SURFACE_TEMP').mean()

B. sea_temp.groupby('MONTH').mean()

C. sea_temp.groupby(['YEAR', 'MONTH']).mean()

D. sea_temp.groupby(['MONTH', 'DAY']).mean()

E. sea_temp.groupby(['MONTH', 'SURFACE_TEMP']).mean()

In [60]:
...
Out[60]:
Ellipsis

Plots of monthly and yearly average surface temperature 📈¶

In [62]:
(sea_temp
 .groupby('MONTH') 
 .mean() 
 .plot(kind='line', y='SURFACE_TEMP')
);
No description has been provided for this image
In [66]:
# Why is there a sudden drop at the end? Look at the dates of data collection!
(sea_temp
 .groupby('YEAR') 
 .mean() 
 .plot(kind='line', y='SURFACE_TEMP')
);
No description has been provided for this image

Summary: Grouping on multiple columns¶

  • Pass a list of columns to .groupby to group on multiple columns. This creates groups within groups.
  • Use .reset_index() after grouping on multiple columns to move the MultiIndex back to the columns.

Merging 🚙¶

In [72]:
offer_percentage = bpd.DataFrame().assign(
    clothing_type=['Shirt', 'Pants', 'Dress', 'Shorts', 'Shoes'],
    offer_percentage=[20, 30, 50, 30, 50]
)

clothes = bpd.DataFrame().assign(
    item=['Dress', 'Shirt', 'Shoes', 'Pants', 'Shoes'],
    retail_price=[150, 30, 90, 50, 70]
)

Example: Clothing Resale 👕¶

In [75]:
# The percentage of retail price that I can earn for reselling my clothes.
offer_percentage
Out[75]:
clothing_type offer_percentage
0 Shirt 20
1 Pants 30
2 Dress 50
3 Shorts 30
4 Shoes 50
In [76]:
# The items I want to sell and their retail prices.
clothes
Out[76]:
item retail_price
0 Dress 150
1 Shirt 30
2 Shoes 90
3 Pants 50
4 Shoes 70
  • Question: If I sell all of the clothes in my collection, how much will I earn?
  • Issue: The information I need to answer the question is spread across multiple DataFrames.

If I sell all of the clothes in my collection, how much will I earn?¶

In [81]:
clothes_merged = offer_percentage.merge(clothes, left_on='clothing_type', right_on='item')
clothes_merged
Out[81]:
clothing_type offer_percentage item retail_price
0 Shirt 20 Shirt 30
1 Pants 30 Pants 50
2 Dress 50 Dress 150
3 Shoes 50 Shoes 90
4 Shoes 50 Shoes 70

What just happened!? 🤯¶

In [84]:
# Click through the presentation that appears.
merging_animation()

.merge¶

To "merge" two DataFrames:

  • Pick a "left" and "right" DataFrame.
  • Choose a column from each to "merge on".
left_df.merge(
    right_df, 
    left_on='left_col_name',
    right_on='right_col_name'
)
  • left_on and right_on should be column names (they don't have to be the same).
  • The resulting DataFrame contains a single row for every match between the two columns.
  • Rows in either DataFrame without a match disappear!

If I sell all of the clothes in my collection, how much will I earn?¶

In [89]:
clothes_merged = offer_percentage.merge(clothes, left_on='clothing_type', right_on='item')
clothes_merged
Out[89]:
clothing_type offer_percentage item retail_price
0 Shirt 20 Shirt 30
1 Pants 30 Pants 50
2 Dress 50 Dress 150
3 Shoes 50 Shoes 90
4 Shoes 50 Shoes 70
In [91]:
# If I sell all of the clothes in my collection, how much will I earn?
(clothes_merged.get('offer_percentage') / 100 * clothes_merged.get('retail_price')).sum() 
Out[91]:
176.0

Does it matter which DataFrame is the left or right DataFrame? 🤔¶

In [94]:
offer_percentage.merge(clothes, left_on='clothing_type', right_on='item')
Out[94]:
clothing_type offer_percentage item retail_price
0 Shirt 20 Shirt 30
1 Pants 30 Pants 50
2 Dress 50 Dress 150
3 Shoes 50 Shoes 90
4 Shoes 50 Shoes 70
In [98]:
clothes.merge(offer_percentage, left_on='item', right_on='clothing_type')
Out[98]:
item retail_price clothing_type offer_percentage
0 Dress 150 Dress 50
1 Shirt 30 Shirt 20
2 Shoes 90 Shoes 50
3 Pants 50 Pants 30
4 Shoes 70 Shoes 50

Answer: The order of the rows and columns will be different, but the content will be the same.

Special cases¶

What if the names of the columns we want to merge on are both the same?¶

Instead of using left_on='col' and right_on='col', you can just say on='col'.

In [103]:
offer_percentage
Out[103]:
clothing_type offer_percentage
0 Shirt 20
1 Pants 30
2 Dress 50
3 Shorts 30
4 Shoes 50
In [105]:
clothes_relabeled = clothes.assign(clothing_type=clothes.get('item')).drop(columns=['item'])
clothes_relabeled
Out[105]:
retail_price clothing_type
0 150 Dress
1 30 Shirt
2 90 Shoes
3 50 Pants
4 70 Shoes

In this example, the column we want to merge on in both DataFrames is named 'clothing_type', so we can just use on='clothing_type'.

In [108]:
offer_percentage.merge(clothes_relabeled, on='clothing_type')
Out[108]:
clothing_type offer_percentage retail_price
0 Shirt 20 30
1 Pants 30 50
2 Dress 50 150
3 Shoes 50 90
4 Shoes 50 70

Notice: There's only one column containing the type of clothing now.

What if we want to merge using an index instead of a column?¶

Instead of using left_on or right_on, use left_index=True or right_index=True.

In [112]:
offers_by_item = offer_percentage.set_index('clothing_type')
offers_by_item
Out[112]:
offer_percentage
clothing_type
Shirt 20
Pants 30
Dress 50
Shorts 30
Shoes 50
In [114]:
clothes
Out[114]:
item retail_price
0 Dress 150
1 Shirt 30
2 Shoes 90
3 Pants 50
4 Shoes 70

In this example, we want to merge using the index in the left DataFrame (offers_by_item) and the item column in the right DataFrame (clothes).

In [117]:
offers_by_item.merge(clothes, left_index=True, right_on='item')
Out[117]:
offer_percentage item retail_price
1 20 Shirt 30
3 30 Pants 50
0 50 Dress 150
2 50 Shoes 90
4 50 Shoes 70

Concept Check ✅ – Answer at cc.dsc10.com¶

How many rows are in weather.merge(schools, on='city')?

A. 4    B. 5    C. 6    D. 7    E. 8
In [120]:
concept_check()

Followup activity¶

How many rows are in weather.merge(schools, on='state')?

More practice!¶

Here are related exam problems to help you practice merging:

  • Problem 5 from the Fall 2021 midterm.
  • Problem 7 from the Fall 2022 midterm.

Summary, next time¶

Summary¶

  • To group on multiple columns, pass a list to .groupby.
    • The result has one row for every unique combination of elements in the specified columns.
  • To combine information from multiple DataFrames, use .merge.
    • When using .merge, Python searches for a match between a specified column in each DataFrame and combines the rows with a match.
    • If there are no matches, the row disappears!

Next time¶

  • if-statements, to execute code only when certain conditions are met.
  • for-loops, to repeat code many times.
  • Both are foundational programming tools. 🛠