In [1]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import display, IFrame

def merging_animation():
    src="https://docs.google.com/presentation/d/e/2PACX-1vSk2FfJ4K_An_CQwcN_Yu5unpJckOZjVQDFqZ78ZTTMmowUsCQKKVnum0_m6TaiGquQ44E3FiS9g2Y4/embed?start=false&loop=false&delayms=60000&rm=minimal"
    width=825
    height=500
    display(IFrame(src, width, height))

Lecture 9 – Grouping on Multiple Columns, Merging¶

DSC 10, Fall 2023¶

Announcements¶

  • Lab 2 is due on Thursday at 11:59PM.
  • Homework 2 is due on Saturday at 11:59PM.
  • Come to office hours for help on assignments or to talk about your quiz performance! The schedule is here.
  • The Midterm Project will be released this weekend.
    • You can work with a partner from any lecture section, but if you do, you must follow these project partner guidelines.
    • We will hold a mixer on Friday from 2-3PM on the HDSI Patio to help you find a partner if you don't have one.

Agenda¶

  • Grouping on multiple columns.
  • Merging.

Grouping on multiple columns¶

DSC 10 student data¶

In [2]:
roster = bpd.read_csv('data/roster-anon.csv')
roster
Out[2]:
name section
0 Kavya Fquroe 10AM
1 Victoria Yppmzx 10AM
2 An-Chi Tmbqlr 8AM
... ... ...
522 Mehri Osrvjq 9AM
523 Noah Byphhr 9AM
524 Emily Hchqii 9AM

525 rows × 2 columns

Recall, last class, we extracted the first name of each student in the class.

In [3]:
def first_name(full_name):
    '''Returns the first name given a full name.'''
    return full_name.split(' ')[0]
In [4]:
roster = roster.assign(
    first=roster.get('name').apply(first_name)
)
roster
Out[4]:
name section first
0 Kavya Fquroe 10AM Kavya
1 Victoria Yppmzx 10AM Victoria
2 An-Chi Tmbqlr 8AM An-Chi
... ... ... ...
522 Mehri Osrvjq 9AM Mehri
523 Noah Byphhr 9AM Noah
524 Emily Hchqii 9AM Emily

525 rows × 3 columns

How many students named Kevin are in each section?¶

We discovered that Kevin is the most popular first name overall.

In [5]:
name_counts = (
    roster
    .groupby('first')
    .count()
    .sort_values('name', ascending=False)
    .get(['name'])
)
name_counts
Out[5]:
name
first
Kevin 7
Daniel 6
Ryan 6
... ...
Hengyu 1
Heeju 1
Zubin 1

438 rows × 1 columns

To find the number of 'Kevin's in each lecture section, we can query for only the rows corresponding to 'Kevin's, and then group by 'section'.

In [6]:
roster[roster.get('first') == 'Kevin'].groupby('section').count()
Out[6]:
name first
section
10AM 1 1
1PM 1 1
8AM 1 1
9AM 4 4

But what if we want to know the number of 'Victoria's and 'Ethan's per section, too?

In [7]:
roster[roster.get('first') == 'Victoria'].groupby('section').count()
Out[7]:
name first
section
10AM 1 1
9AM 1 1
In [8]:
roster[roster.get('first') == 'Ethan'].groupby('section').count()
Out[8]:
name first
section
10AM 3 3

Is there a way to do this for all first names and sections all at once?

How many students with each first name does each lecture section have?¶

  • Right now, we can count the number of students with each first name, by grouping roster by 'first'.
In [9]:
# One row per unique first name.
roster.groupby('first').count().get('name')
Out[9]:
first
Aadit      1
Aaron      1
Aarshia    1
          ..
Zhijie     1
Ziyan      1
Zubin      1
Name: name, Length: 438, dtype: int64
  • We can also count the number of students in each lecture section, by grouping roster by 'section'.
In [10]:
# One row per unique section.
roster.groupby('section').count().get('name')
Out[10]:
section
10AM    150
1PM     150
8AM      75
9AM     150
Name: name, dtype: int64
  • However, neither of the above Series give us the number of students with each first name in each section.
    • For instance, neither result tells me the number of 'Kevin's in the 9AM section or the number of 'Victoria's in the 1PM section.
  • It would be nice if we could group by both 'first' and 'section' – and we can!

Grouping on multiple columns¶

In [11]:
roster
Out[11]:
name section first
0 Kavya Fquroe 10AM Kavya
1 Victoria Yppmzx 10AM Victoria
2 An-Chi Tmbqlr 8AM An-Chi
... ... ... ...
522 Mehri Osrvjq 9AM Mehri
523 Noah Byphhr 9AM Noah
524 Emily Hchqii 9AM Emily

525 rows × 3 columns

We can pass a list of column names to .groupby!

In [12]:
roster.groupby(['section', 'first']).count()
Out[12]:
name
section first
10AM Aadit 1
Aiden 1
Akshay 1
... ... ...
9AM Yun 1
Zhaocheng 1
Ziyan 1

497 rows × 1 columns

The above DataFrame is telling us, for instance, that there is 1 student with the first name 'Aadit' in the 10AM section.

It is not saying that there is only one 'Aadit's in the course overall. There could be more in the other sections.

Grouping on multiple columns¶

  • To group on multiple columns, pass a list of column names to .groupby:


df.groupby(['col_1', 'col_2', ..., 'col_k'])
  • Group by 'col_1' first. Within each group, group by 'col_2', and so on.
  • Important: The resulting DataFrame has one row per unique combination of entries in the specified columns.
    • On the previous slide, we had exactly one row for every combination of 'section' and 'first'.
  • Formally, when we group on multiple columns, we are creating subgroups – that is, groups within groups.
    • On the previous slide, we first grouped by 'section', and within each section, we grouped by 'first'.

Notice the index... 🤔¶

  • This is called a "MultiIndex".
    • The DataFrame is indexed by 'section' and 'first'.
  • We won't worry about the details of MultiIndexes.
  • We can use .reset_index() to "flatten" our DataFrame back to normal.
In [13]:
roster.groupby(['section', 'first']).count().reset_index()
Out[13]:
section first name
0 10AM Aadit 1
1 10AM Aiden 1
2 10AM Akshay 1
... ... ... ...
494 9AM Yun 1
495 9AM Zhaocheng 1
496 9AM Ziyan 1

497 rows × 3 columns

Does order matter?¶

In [14]:
roster.groupby(['section', 'first']).count().reset_index()
Out[14]:
section first name
0 10AM Aadit 1
1 10AM Aiden 1
2 10AM Akshay 1
... ... ... ...
494 9AM Yun 1
495 9AM Zhaocheng 1
496 9AM Ziyan 1

497 rows × 3 columns

In [15]:
roster.groupby(['first', 'section']).count().reset_index()
Out[15]:
first section name
0 Aadit 10AM 1
1 Aaron 8AM 1
2 Aarshia 1PM 1
... ... ... ...
494 Zhijie 1PM 1
495 Ziyan 9AM 1
496 Zubin 10AM 1

497 rows × 3 columns

Answer: Kind of. The order of the rows and columns will be different, but the content will be the same.

Activity¶

Using counts, find the lecture section with the most 'Kevin's.

✅ Click here to see the solution after you've tried it yourself.

kevin_counts = counts[counts.get('first') == 'Kevin']
kevin_counts.sort_values('name', ascending=False).get('section').iloc[0]

In [16]:
counts = roster.groupby(['section', 'first']).count().reset_index()
counts
Out[16]:
section first name
0 10AM Aadit 1
1 10AM Aiden 1
2 10AM Akshay 1
... ... ... ...
494 9AM Yun 1
495 9AM Zhaocheng 1
496 9AM Ziyan 1

497 rows × 3 columns

In [17]:
...
Out[17]:
Ellipsis

Activity¶

Using counts, find the shortest first name in the class that is shared by at least two students in the same section.

✅ Click here to see the solution after you've tried it yourself.

with_len = counts.assign(length=counts.get('first').apply(len))
with_len[with_len.get('name') >= 2].sort_values('length')

# Turns out there are many possible answers: Eric, Ryan, Evan and John.

In [18]:
...
Out[18]:
Ellipsis

Example: Sea temperatures 🌊¶

This dataset contains the sea surface temperature in La Jolla, on many days ranging from August 22, 1916 to March 31, 2023.

In [19]:
sea_temp = bpd.read_csv('data/sea_temp.csv')
sea_temp
Out[19]:
YEAR MONTH DAY SURFACE_TEMP
0 1916 8 22 19.5
1 1916 8 23 19.9
2 1916 8 24 19.7
... ... ... ... ...
37648 2023 3 29 14.4
37649 2023 3 30 14.8
37650 2023 3 31 15.1

37651 rows × 4 columns

Concept Check ✅ – Answer at cc.dsc10.com¶

We want to find the single month (e.g. November 1998) with the highest average 'SURFACE_TEMP'.

Which of the following would help us achieve this goal?

A. sea_temp.groupby('SURFACE_TEMP').mean()

B. sea_temp.groupby('MONTH').mean()

C. sea_temp.groupby(['YEAR', 'MONTH']).mean()

D. sea_temp.groupby(['MONTH', 'DAY']).mean()

E. sea_temp.groupby(['MONTH', 'SURFACE_TEMP']).mean()

In [20]:
...
Out[20]:
Ellipsis

Plots of monthly and yearly average surface temperature 📈¶

In [21]:
(sea_temp
 .groupby('MONTH') 
 .mean() 
 .plot(kind='line', y='SURFACE_TEMP')
);
2023-10-17T14:27:40.626468 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [22]:
# Why is there a sudden drop at the end? Look at the dates of data collection!
(sea_temp
 .groupby('YEAR') 
 .mean() 
 .plot(kind='line', y='SURFACE_TEMP')
);
2023-10-17T14:27:40.756332 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Summary: Grouping on multiple columns¶

  • Pass a list of columns to .groupby to group on multiple columns. This creates groups within groups.
  • Use .reset_index() after grouping on multiple columns to move the MultiIndex back to the columns.

Merging 🚙¶

In [23]:
phones = bpd.DataFrame().assign(
    Model=['iPhone 13', 'iPhone 13 Pro Max', 'Samsung Galaxy Z Flip', 'Pixel 5a'],
    Price=[799, 1099, 999, 449],
    Screen=[6.1, 6.7, 6.7, 6.3]
)

inventory = bpd.DataFrame().assign(
    Handset=['iPhone 13 Pro Max', 'iPhone 13', 'Pixel 5a', 'iPhone 13'],
    Units=[50, 40, 10, 100],
    Store=['Westfield UTC', 'Westfield UTC', 'Fashion Valley', 'Downtown']
)

Example: Phone sales 📱¶

In [24]:
# Phones on the market right now.
phones
Out[24]:
Model Price Screen
0 iPhone 13 799 6.1
1 iPhone 13 Pro Max 1099 6.7
2 Samsung Galaxy Z Flip 999 6.7
3 Pixel 5a 449 6.3
In [25]:
# Which phones my stores have in stock in the area.
inventory
Out[25]:
Handset Units Store
0 iPhone 13 Pro Max 50 Westfield UTC
1 iPhone 13 40 Westfield UTC
2 Pixel 5a 10 Fashion Valley
3 iPhone 13 100 Downtown
  • Question: If I sell all of the phones in my inventory, how much will I make in revenue?
  • Issue: The information I need to answer the question is spread across multiple DataFrames.

If I sell all of the phones in my inventory, how much will I make in revenue?¶

In [26]:
phones.merge(inventory, left_on='Model', right_on='Handset')
Out[26]:
Model Price Screen Handset Units Store
0 iPhone 13 799 6.1 iPhone 13 40 Westfield UTC
1 iPhone 13 799 6.1 iPhone 13 100 Downtown
2 iPhone 13 Pro Max 1099 6.7 iPhone 13 Pro Max 50 Westfield UTC
3 Pixel 5a 449 6.3 Pixel 5a 10 Fashion Valley

What just happened!? 🤯¶

In [27]:
# Click through the presentation that appears.
merging_animation()

.merge¶

To "merge" two DataFrames:

  • Pick a "left" and "right" DataFrame.
  • Choose a column from each to "merge on".
    left_df.merge(
      right_df, 
      left_on='left_col_name',
      right_on='right_col_name'
    )
    
  • left_on and right_on should be column names (they don't have to be the same).
  • The resulting DataFrame contains a single row for every match between the two columns.
  • Rows in either DataFrame without a match disappear!

If I sell all of the phones in my inventory, how much will I make in revenue?¶

In [28]:
phones
Out[28]:
Model Price Screen
0 iPhone 13 799 6.1
1 iPhone 13 Pro Max 1099 6.7
2 Samsung Galaxy Z Flip 999 6.7
3 Pixel 5a 449 6.3
In [29]:
inventory
Out[29]:
Handset Units Store
0 iPhone 13 Pro Max 50 Westfield UTC
1 iPhone 13 40 Westfield UTC
2 Pixel 5a 10 Fashion Valley
3 iPhone 13 100 Downtown
In [30]:
# Notice there's no Samsung Galaxy Z Flip in phones_merged!
phones_merged = phones.merge(inventory, left_on='Model', right_on='Handset')
phones_merged
Out[30]:
Model Price Screen Handset Units Store
0 iPhone 13 799 6.1 iPhone 13 40 Westfield UTC
1 iPhone 13 799 6.1 iPhone 13 100 Downtown
2 iPhone 13 Pro Max 1099 6.7 iPhone 13 Pro Max 50 Westfield UTC
3 Pixel 5a 449 6.3 Pixel 5a 10 Fashion Valley
In [31]:
(phones_merged.get('Price') * phones_merged.get('Units')).sum()
Out[31]:
171300

Does it matter which DataFrame is the left or right DataFrame? 🤔¶

In [32]:
phones.merge(inventory, left_on='Model', right_on='Handset')
Out[32]:
Model Price Screen Handset Units Store
0 iPhone 13 799 6.1 iPhone 13 40 Westfield UTC
1 iPhone 13 799 6.1 iPhone 13 100 Downtown
2 iPhone 13 Pro Max 1099 6.7 iPhone 13 Pro Max 50 Westfield UTC
3 Pixel 5a 449 6.3 Pixel 5a 10 Fashion Valley
In [33]:
inventory.merge(phones, left_on='Handset', right_on='Model')
Out[33]:
Handset Units Store Model Price Screen
0 iPhone 13 Pro Max 50 Westfield UTC iPhone 13 Pro Max 1099 6.7
1 iPhone 13 40 Westfield UTC iPhone 13 799 6.1
2 iPhone 13 100 Downtown iPhone 13 799 6.1
3 Pixel 5a 10 Fashion Valley Pixel 5a 449 6.3

Answer: The order of the rows and columns will be different, but the content will be the same.

Special cases¶

What if the names of the columns we want to merge on are both the same?¶

Instead of using left_on='col' and right_on='col', you can just say on='col'.

In [34]:
phones
Out[34]:
Model Price Screen
0 iPhone 13 799 6.1
1 iPhone 13 Pro Max 1099 6.7
2 Samsung Galaxy Z Flip 999 6.7
3 Pixel 5a 449 6.3
In [35]:
inventory_relabeled = inventory.assign(Model=inventory.get('Handset')).drop(columns=['Handset'])
inventory_relabeled
Out[35]:
Units Store Model
0 50 Westfield UTC iPhone 13 Pro Max
1 40 Westfield UTC iPhone 13
2 10 Fashion Valley Pixel 5a
3 100 Downtown iPhone 13

In this example, the column we want to merge on in both DataFrames is named 'Model', so we can just use on='Model'.

In [36]:
phones.merge(inventory_relabeled, on='Model')
Out[36]:
Model Price Screen Units Store
0 iPhone 13 799 6.1 40 Westfield UTC
1 iPhone 13 799 6.1 100 Downtown
2 iPhone 13 Pro Max 1099 6.7 50 Westfield UTC
3 Pixel 5a 449 6.3 10 Fashion Valley

Notice: There's only one column containing phone names now.

What if we want to merge using an index instead of a column?¶

Instead of using left_on or right_on, use left_index=True or right_index=True.

In [37]:
phones
Out[37]:
Model Price Screen
0 iPhone 13 799 6.1
1 iPhone 13 Pro Max 1099 6.7
2 Samsung Galaxy Z Flip 999 6.7
3 Pixel 5a 449 6.3
In [38]:
inventory_by_handset = inventory.set_index('Handset')
inventory_by_handset
Out[38]:
Units Store
Handset
iPhone 13 Pro Max 50 Westfield UTC
iPhone 13 40 Westfield UTC
Pixel 5a 10 Fashion Valley
iPhone 13 100 Downtown

In this example, we want to merge using the 'Model' column in the left DataFrame (phones) and the index in the right DataFrame (inventory_by_handset); note that 'Handset' is not a column in the right DataFrame anymore.

In [39]:
phones.merge(inventory_by_handset, left_on='Model', right_index=True)
Out[39]:
Model Price Screen Units Store
0 iPhone 13 799 6.1 40 Westfield UTC
0 iPhone 13 799 6.1 100 Downtown
1 iPhone 13 Pro Max 1099 6.7 50 Westfield UTC
3 Pixel 5a 449 6.3 10 Fashion Valley

Activity setup¶

In [40]:
nice_weather_cities = bpd.DataFrame().assign(
    city=['La Jolla', 'San Diego', 'Austin', 'Los Angeles'],
    state=['California', 'California', 'Texas', 'California'],
    today_high_temp=['79', '83', '87', '87']
    
)

schools = bpd.DataFrame().assign(
    name=['UCSD', 'University of Chicago', 'University of San Diego','Johns Hopkins University', 'UT Austin', 'SDSU', 'UCLA'], 
    city=['La Jolla', 'Chicago', 'San Diego', 'Baltimore', 'Austin', 'San Diego', 'Los Angeles'],
    state=['California', 'Illinois', 'California', 'Maryland', 'Texas', 'California', 'California'],
    graduation_rate=[0.87, 0.94, 0.78, 0.92, 0.81, 0.83, 0.91 ]
)

Concept Check ✅ – Answer at cc.dsc10.com¶

Without writing code, how many rows are in nice_weather_cities.merge(schools, on='city')?

A. 4    B. 5    C. 6    D. 7    E. 8
In [41]:
nice_weather_cities
Out[41]:
city state today_high_temp
0 La Jolla California 79
1 San Diego California 83
2 Austin Texas 87
3 Los Angeles California 87
In [42]:
schools
Out[42]:
name city state graduation_rate
0 UCSD La Jolla California 0.87
1 University of Chicago Chicago Illinois 0.94
2 University of San Diego San Diego California 0.78
3 Johns Hopkins University Baltimore Maryland 0.92
4 UT Austin Austin Texas 0.81
5 SDSU San Diego California 0.83
6 UCLA Los Angeles California 0.91

Followup activity¶

Without writing code, how many rows are in nice_weather_cities.merge(schools, on='state')?

In [43]:
nice_weather_cities
Out[43]:
city state today_high_temp
0 La Jolla California 79
1 San Diego California 83
2 Austin Texas 87
3 Los Angeles California 87
In [44]:
schools
Out[44]:
name city state graduation_rate
0 UCSD La Jolla California 0.87
1 University of Chicago Chicago Illinois 0.94
2 University of San Diego San Diego California 0.78
3 Johns Hopkins University Baltimore Maryland 0.92
4 UT Austin Austin Texas 0.81
5 SDSU San Diego California 0.83
6 UCLA Los Angeles California 0.91
In [45]:
nice_weather_cities.merge(schools, on='state')
Out[45]:
city_x state today_high_temp name city_y graduation_rate
0 La Jolla California 79 UCSD La Jolla 0.87
1 La Jolla California 79 University of San Diego San Diego 0.78
2 La Jolla California 79 SDSU San Diego 0.83
... ... ... ... ... ... ...
10 Los Angeles California 87 SDSU San Diego 0.83
11 Los Angeles California 87 UCLA Los Angeles 0.91
12 Austin Texas 87 UT Austin Austin 0.81

13 rows × 6 columns

In [46]:
nice_weather_cities.merge(schools, on='state').shape[0]
Out[46]:
13

More practice!¶

Here are related exam problems to help you practice merging:

  • Problem 5 from the Fall 2021 midterm.
  • Problem 7 from the Fall 2022 midterm.

Summary, next time¶

Summary¶

  • To group on multiple columns, pass a list to .groupby.
    • The result has one row for every unique combination of elements in the specified columns.
  • To combine information from multiple DataFrames, use .merge.
    • When using .merge, Python searches for a match between a specified column in each DataFrame and combines the rows with a match.
    • If there are no matches, the row disappears!

Next time¶

  • if-statements, to execute code only when certain conditions are met.
  • for-loops, to repeat code many times.
  • Both are foundational programming tools. 🛠