# Run this cell to set up packages for lecture.
from lec09_imports import *
Announcements¶
- Lab 2 is due tonight, 11:59PM.
- The Midterm Project is coming out soon. Try to find a partner!👯♀️
- Working with a single partner is option but highly recommended. You must follow these project partner guidelines.
Agenda¶
- Grouping on multiple columns.
- Merging.
Grouping on multiple columns¶
DSC 10 student data¶
roster = bpd.read_csv('data/roster-anon.csv')
roster
name | section | |
---|---|---|
0 | Allie Sazhma | 11AM |
1 | Amina Igxazd | 10AM |
2 | Jazmine Enesxr | 9AM |
... | ... | ... |
219 | Ismayl Gwuiij | 10AM |
220 | Neil Dkaqgm | 10AM |
221 | Maggie Ldfgau | 9AM |
222 rows × 2 columns
Recall, last class, we extracted the first name of each student in the class.
def first_name(full_name):
'''Returns the first name given a full name.'''
return full_name.split(' ')[0]
roster = roster.assign(
first=roster.get('name').apply(first_name)
)
roster
name | section | first | |
---|---|---|---|
0 | Allie Sazhma | 11AM | Allie |
1 | Amina Igxazd | 10AM | Amina |
2 | Jazmine Enesxr | 9AM | Jazmine |
... | ... | ... | ... |
219 | Ismayl Gwuiij | 10AM | Ismayl |
220 | Neil Dkaqgm | 10AM | Neil |
221 | Maggie Ldfgau | 9AM | Maggie |
222 rows × 3 columns
How many students named Kevin are in each section?¶
We discovered that Kevin and Ryan are the most popular first names overall.
name_counts = (
roster
.groupby('first')
.count()
.sort_values('name', ascending=False)
.get(['name'])
)
name_counts
name | |
---|---|
first | |
Kevin | 4 |
Ryan | 4 |
Noah | 3 |
... | ... |
Hongyu | 1 |
Hriday | 1 |
Zixuan | 1 |
200 rows × 1 columns
To find the number of 'Kevin'
s in each lecture section, we can query for only the rows corresponding to 'Kevin'
s, and then group by 'section'
.
roster[roster.get('first') == 'Kevin'].groupby('section').count()
name | first | |
---|---|---|
section | ||
11AM | 3 | 3 |
9AM | 1 | 1 |
But what if we want to know the number of 'Rishabh'
s and 'Rafi'
s per section, too?
roster[roster.get('first') == 'Rishabh'].groupby('section').count()
name | first | |
---|---|---|
section | ||
10AM | 1 | 1 |
roster[roster.get('first') == 'Rafi'].groupby('section').count()
name | first | |
---|---|---|
section | ||
11AM | 1 | 1 |
Is there a way to do this for all first names and sections all at once?
How many students with each first name does each lecture section have?¶
- Right now, we can count the number of students with each first name, by grouping
roster
by'first'
.
# One row per unique first name.
roster.groupby('first').count().get('name')
first Aadya 1 Aanya 1 Aarav 1 .. Yuxun 1 Zach 1 Zixuan 1 Name: name, Length: 200, dtype: int64
- We can also count the number of students in each lecture section, by grouping
roster
by'section'
.
# One row per unique section.
roster.groupby('section').count().get('name')
section 10AM 82 11AM 93 9AM 47 Name: name, dtype: int64
- However, neither of the above Series give us the number of students with each first name in each section.
- For instance, neither result tells me the number of
'Kevin'
s in the 11AM section or the number of'Rishabh'
s in the 10AM section.
- For instance, neither result tells me the number of
- It would be nice if we could group by both
'first'
and'section'
– and we can!
Grouping on multiple columns¶
roster
name | section | first | |
---|---|---|---|
0 | Allie Sazhma | 11AM | Allie |
1 | Amina Igxazd | 10AM | Amina |
2 | Jazmine Enesxr | 9AM | Jazmine |
... | ... | ... | ... |
219 | Ismayl Gwuiij | 10AM | Ismayl |
220 | Neil Dkaqgm | 10AM | Neil |
221 | Maggie Ldfgau | 9AM | Maggie |
222 rows × 3 columns
We can pass a list of column names to .groupby
!
roster.groupby(['section', 'first']).count()
name | ||
---|---|---|
section | first | |
10AM | Aadya | 1 |
Abigail | 1 | |
Achintya | 1 | |
... | ... | ... |
9AM | Yisi | 1 |
Yujie | 1 | |
Zixuan | 1 |
210 rows × 1 columns
The above DataFrame is telling us, for instance, that there is 1 student with the first name 'Aadya'
in the 10AM section.
It is not saying that there is only one 'Aadya'
in the course overall. There could be more in the other sections.
Grouping on multiple columns¶
- To group on multiple columns, pass a list of column names to
.groupby
:
df.groupby(['col_1', 'col_2', ..., 'col_k'])
- Group by
'col_1'
first. Within each group, group by'col_2'
, and so on.
- Important: The resulting DataFrame has one row per unique combination of entries in the specified columns.
- On the previous slide, we had exactly one row for every combination of
'section'
and'first'
.
- On the previous slide, we had exactly one row for every combination of
- Formally, when we group on multiple columns, we are creating subgroups – that is, groups within groups.
- On the previous slide, we first grouped by
'section'
, and within each section, we grouped by'first'
.
- On the previous slide, we first grouped by
Notice the index... 🤔¶
- This is called a "MultiIndex".
- The DataFrame is indexed by
'section'
and'first'
.
- The DataFrame is indexed by
- We won't worry about the details of MultiIndexes.
- We can use
.reset_index()
to "flatten" our DataFrame back to normal.
roster.groupby(['section', 'first']).count().reset_index()
section | first | name | |
---|---|---|---|
0 | 10AM | Aadya | 1 |
1 | 10AM | Abigail | 1 |
2 | 10AM | Achintya | 1 |
... | ... | ... | ... |
207 | 9AM | Yisi | 1 |
208 | 9AM | Yujie | 1 |
209 | 9AM | Zixuan | 1 |
210 rows × 3 columns
Does order matter?¶
roster.groupby(['section', 'first']).count().reset_index()
section | first | name | |
---|---|---|---|
0 | 10AM | Aadya | 1 |
1 | 10AM | Abigail | 1 |
2 | 10AM | Achintya | 1 |
... | ... | ... | ... |
207 | 9AM | Yisi | 1 |
208 | 9AM | Yujie | 1 |
209 | 9AM | Zixuan | 1 |
210 rows × 3 columns
roster.groupby(['first', 'section']).count().reset_index()
first | section | name | |
---|---|---|---|
0 | Aadya | 10AM | 1 |
1 | Aanya | 11AM | 1 |
2 | Aarav | 11AM | 1 |
... | ... | ... | ... |
207 | Yuxun | 10AM | 1 |
208 | Zach | 11AM | 1 |
209 | Zixuan | 9AM | 1 |
210 rows × 3 columns
Answer: Kind of. The order of the rows and columns will be different, but the content will be the same.
Activity¶
Using counts
, find the lecture section with the most 'Kevin'
s.
✅ Click here to see the solution after you've tried it yourself.
kevin_counts = counts[counts.get('first') == 'Kevin'] kevin_counts.sort_values('name', ascending=False).get('section').iloc[0]
counts = roster.groupby(['section', 'first']).count().reset_index()
counts
section | first | name | |
---|---|---|---|
0 | 10AM | Aadya | 1 |
1 | 10AM | Abigail | 1 |
2 | 10AM | Achintya | 1 |
... | ... | ... | ... |
207 | 9AM | Yisi | 1 |
208 | 9AM | Yujie | 1 |
209 | 9AM | Zixuan | 1 |
210 rows × 3 columns
...
Ellipsis
Activity¶
Using counts
, find the longest first name in the class that is shared by at least two students in the same section.
✅ Click here to see the solution after you've tried it yourself.
with_len = counts.assign(length=counts.get('first').apply(len)) with_len[with_len.get('name') >= 2].sort_values('length', ascending=False).get('first').iloc[0]
...
Ellipsis
Example: Sea temperatures 🌊¶
This dataset contains the sea surface temperature in La Jolla, on many days ranging from August 22, 1916 to June 30, 2023.
sea_temp = bpd.read_csv('data/sea_temp.csv')
sea_temp
YEAR | MONTH | DAY | SURFACE_TEMP | |
---|---|---|---|---|
0 | 1916 | 8 | 22 | 19.5 |
1 | 1916 | 8 | 23 | 19.9 |
2 | 1916 | 8 | 24 | 19.7 |
... | ... | ... | ... | ... |
37738 | 2023 | 6 | 28 | 19.7 |
37739 | 2023 | 6 | 29 | 19.3 |
37740 | 2023 | 6 | 30 | 20.6 |
37741 rows × 4 columns
Concept Check ✅ – Answer at cc.dsc10.com¶
We want to find the single month (e.g. November 1998) with the highest average 'SURFACE_TEMP'
.
Which of the following would help us achieve this goal?
A. sea_temp.groupby('SURFACE_TEMP').mean()
B. sea_temp.groupby('MONTH').mean()
C. sea_temp.groupby(['YEAR', 'MONTH']).mean()
D. sea_temp.groupby(['MONTH', 'DAY']).mean()
E. sea_temp.groupby(['MONTH', 'SURFACE_TEMP']).mean()
...
Ellipsis
Plots of monthly and yearly average surface temperature 📈¶
(sea_temp
.groupby('MONTH')
.mean()
.plot(kind='line', y='SURFACE_TEMP')
);
# Why is there a sudden drop at the end? Look at the dates of data collection!
(sea_temp
.groupby('YEAR')
.mean()
.plot(kind='line', y='SURFACE_TEMP')
);
Summary: Grouping on multiple columns¶
- Pass a list of columns to
.groupby
to group on multiple columns. This creates groups within groups. - Use
.reset_index()
after grouping on multiple columns to move the MultiIndex back to the columns.
Merging 🚙¶
offer_percentage = bpd.DataFrame().assign(
clothing_type=['Shirt', 'Pants', 'Dress', 'Shorts', 'Shoes'],
offer_percentage=[20, 30, 50, 30, 50]
)
clothes = bpd.DataFrame().assign(
item=['Dress', 'Shirt', 'Shoes', 'Pants', 'Shoes'],
retail_price=[150, 30, 90, 50, 70]
)
Example: Clothing Resale 👕¶
# The percentage of retail price that I can earn for reselling my clothes.
offer_percentage
clothing_type | offer_percentage | |
---|---|---|
0 | Shirt | 20 |
1 | Pants | 30 |
2 | Dress | 50 |
3 | Shorts | 30 |
4 | Shoes | 50 |
# The items I want to sell and their retail prices.
clothes
item | retail_price | |
---|---|---|
0 | Dress | 150 |
1 | Shirt | 30 |
2 | Shoes | 90 |
3 | Pants | 50 |
4 | Shoes | 70 |
- Question: If I sell all of the clothes in my collection, how much will I earn?
- Issue: The information I need to answer the question is spread across multiple DataFrames.
If I sell all of the clothes in my collection, how much will I earn?¶
clothes_merged = offer_percentage.merge(clothes, left_on='clothing_type', right_on='item')
clothes_merged
clothing_type | offer_percentage | item | retail_price | |
---|---|---|---|---|
0 | Shirt | 20 | Shirt | 30 |
1 | Pants | 30 | Pants | 50 |
2 | Dress | 50 | Dress | 150 |
3 | Shoes | 50 | Shoes | 90 |
4 | Shoes | 50 | Shoes | 70 |
What just happened!? 🤯¶
# Click through the presentation that appears.
merging_animation()
.merge
¶
To "merge" two DataFrames:
- Pick a "left" and "right" DataFrame.
- Choose a column from each to "merge on".
left_df.merge(
right_df,
left_on='left_col_name',
right_on='right_col_name'
)
left_on
andright_on
should be column names (they don't have to be the same).- The resulting DataFrame contains a single row for every match between the two columns.
- Rows in either DataFrame without a match disappear!
If I sell all of the clothes in my collection, how much will I earn?¶
clothes_merged = offer_percentage.merge(clothes, left_on='clothing_type', right_on='item')
clothes_merged
clothing_type | offer_percentage | item | retail_price | |
---|---|---|---|---|
0 | Shirt | 20 | Shirt | 30 |
1 | Pants | 30 | Pants | 50 |
2 | Dress | 50 | Dress | 150 |
3 | Shoes | 50 | Shoes | 90 |
4 | Shoes | 50 | Shoes | 70 |
# If I sell all of the clothes in my collection, how much will I earn?
(clothes_merged.get('offer_percentage') / 100 * clothes_merged.get('retail_price')).sum()
176.0
Does it matter which DataFrame is the left or right DataFrame? 🤔¶
offer_percentage.merge(clothes, left_on='clothing_type', right_on='item')
clothing_type | offer_percentage | item | retail_price | |
---|---|---|---|---|
0 | Shirt | 20 | Shirt | 30 |
1 | Pants | 30 | Pants | 50 |
2 | Dress | 50 | Dress | 150 |
3 | Shoes | 50 | Shoes | 90 |
4 | Shoes | 50 | Shoes | 70 |
clothes.merge(offer_percentage, left_on='item', right_on='clothing_type')
item | retail_price | clothing_type | offer_percentage | |
---|---|---|---|---|
0 | Dress | 150 | Dress | 50 |
1 | Shirt | 30 | Shirt | 20 |
2 | Shoes | 90 | Shoes | 50 |
3 | Shoes | 70 | Shoes | 50 |
4 | Pants | 50 | Pants | 30 |
Answer: The order of the rows and columns will be different, but the content will be the same.
Special cases¶
What if the names of the columns we want to merge on are both the same?¶
Instead of using left_on='col'
and right_on='col'
, you can just say on='col'
.
offer_percentage
clothing_type | offer_percentage | |
---|---|---|
0 | Shirt | 20 |
1 | Pants | 30 |
2 | Dress | 50 |
3 | Shorts | 30 |
4 | Shoes | 50 |
clothes_relabeled = clothes.assign(clothing_type=clothes.get('item')).drop(columns=['item'])
clothes_relabeled
retail_price | clothing_type | |
---|---|---|
0 | 150 | Dress |
1 | 30 | Shirt |
2 | 90 | Shoes |
3 | 50 | Pants |
4 | 70 | Shoes |
In this example, the column we want to merge on in both DataFrames is named 'clothing_type'
, so we can just use on='clothing_type'
.
offer_percentage.merge(clothes_relabeled, on='clothing_type')
clothing_type | offer_percentage | retail_price | |
---|---|---|---|
0 | Shirt | 20 | 30 |
1 | Pants | 30 | 50 |
2 | Dress | 50 | 150 |
3 | Shoes | 50 | 90 |
4 | Shoes | 50 | 70 |
Notice: There's only one column containing the type of clothing now.
What if we want to merge using an index instead of a column?¶
Instead of using left_on
or right_on
, use left_index=True
or right_index=True
.
offers_by_item = offer_percentage.set_index('clothing_type')
offers_by_item
offer_percentage | |
---|---|
clothing_type | |
Shirt | 20 |
Pants | 30 |
Dress | 50 |
Shorts | 30 |
Shoes | 50 |
clothes
item | retail_price | |
---|---|---|
0 | Dress | 150 |
1 | Shirt | 30 |
2 | Shoes | 90 |
3 | Pants | 50 |
4 | Shoes | 70 |
In this example, we want to merge using the index in the left DataFrame (offers_by_item
) and the item
column in the right DataFrame (clothes
).
offers_by_item.merge(clothes, left_index=True, right_on='item')
offer_percentage | item | retail_price | |
---|---|---|---|
1 | 20 | Shirt | 30 |
3 | 30 | Pants | 50 |
0 | 50 | Dress | 150 |
2 | 50 | Shoes | 90 |
4 | 50 | Shoes | 70 |
Activity setup¶
nice_weather_cities = bpd.DataFrame().assign(
city=['La Jolla', 'San Diego', 'Austin', 'Los Angeles'],
state=['California', 'California', 'Texas', 'California'],
today_high_temp=['79', '83', '97', '87']
)
schools = bpd.DataFrame().assign(
name=['UCSD', 'University of Chicago', 'University of San Diego','Johns Hopkins University', 'UT Austin', 'SDSU', 'UCLA'],
city=['La Jolla', 'Chicago', 'San Diego', 'Baltimore', 'Austin', 'San Diego', 'Los Angeles'],
state=['California', 'Illinois', 'California', 'Maryland', 'Texas', 'California', 'California'],
graduation_rate=[0.87, 0.94, 0.78, 0.92, 0.81, 0.83, 0.91 ]
)
Concept Check ✅ – Answer at cc.dsc10.com¶
Without writing code, how many rows are in nice_weather_cities.merge(schools, on='city')
?
A. 4 B. 5 C. 6 D. 7 E. 8
nice_weather_cities
city | state | today_high_temp | |
---|---|---|---|
0 | La Jolla | California | 79 |
1 | San Diego | California | 83 |
2 | Austin | Texas | 97 |
3 | Los Angeles | California | 87 |
schools
name | city | state | graduation_rate | |
---|---|---|---|---|
0 | UCSD | La Jolla | California | 0.87 |
1 | University of Chicago | Chicago | Illinois | 0.94 |
2 | University of San Diego | San Diego | California | 0.78 |
3 | Johns Hopkins University | Baltimore | Maryland | 0.92 |
4 | UT Austin | Austin | Texas | 0.81 |
5 | SDSU | San Diego | California | 0.83 |
6 | UCLA | Los Angeles | California | 0.91 |
Followup activity¶
Without writing code, how many rows are in nice_weather_cities.merge(schools, on='state')
?
nice_weather_cities
city | state | today_high_temp | |
---|---|---|---|
0 | La Jolla | California | 79 |
1 | San Diego | California | 83 |
2 | Austin | Texas | 97 |
3 | Los Angeles | California | 87 |
schools
name | city | state | graduation_rate | |
---|---|---|---|---|
0 | UCSD | La Jolla | California | 0.87 |
1 | University of Chicago | Chicago | Illinois | 0.94 |
2 | University of San Diego | San Diego | California | 0.78 |
3 | Johns Hopkins University | Baltimore | Maryland | 0.92 |
4 | UT Austin | Austin | Texas | 0.81 |
5 | SDSU | San Diego | California | 0.83 |
6 | UCLA | Los Angeles | California | 0.91 |
nice_weather_cities.merge(schools, on='state')
city_x | state | today_high_temp | name | city_y | graduation_rate | |
---|---|---|---|---|---|---|
0 | La Jolla | California | 79 | UCSD | La Jolla | 0.87 |
1 | La Jolla | California | 79 | University of San Diego | San Diego | 0.78 |
2 | La Jolla | California | 79 | SDSU | San Diego | 0.83 |
... | ... | ... | ... | ... | ... | ... |
10 | Los Angeles | California | 87 | SDSU | San Diego | 0.83 |
11 | Los Angeles | California | 87 | UCLA | Los Angeles | 0.91 |
12 | Austin | Texas | 97 | UT Austin | Austin | 0.81 |
13 rows × 6 columns
nice_weather_cities.merge(schools, on='state').shape[0]
13
More practice!¶
Here are related exam problems to help you practice merging:
Summary, next time¶
Summary¶
- To group on multiple columns, pass a list to
.groupby
.- The result has one row for every unique combination of elements in the specified columns.
- To combine information from multiple DataFrames, use
.merge
.- When using
.merge
, Python searches for a match between a specified column in each DataFrame and combines the rows with a match. - If there are no matches, the row disappears!
- When using
Next time¶
if
-statements, to execute code only when certain conditions are met.for
-loops, to repeat code many times.- Both are foundational programming tools. 🛠