In [1]:
# Run this cell to set up packages for lecture.
from lec05_imports import *

Lecture 5 – Querying and Grouping¶

DSC 10, Spring 2024¶

Announcements¶

  • Quiz 1 is coming up on Friday in your assigned quiz session.
    • You should have been assigned a quiz time in an email with subject "DSC 10 Quiz Time".
    • This will be a 20 minute paper-based quiz consisting of short answer and multiple choice questions.
    • No aids are allowed (no notes, no calculators, no computers, no reference sheet). Questions are designed with this in mind.
    • The quiz covers Lectures 1 through 4, or BPD 1-9 in the babypandas notes.
      • This includes "Example 4: What is the population density of Pennsylvania?" which is in the Lecture 4 slides, though we'll cover it today.
      • Attend discussion section today to work on practice problems.
      • Homework 1 is due Tuesday, but working on the homework is excellent preparation for the quiz.
  • Come to office hours (see the schedule here) and post on Ed for help!

Agenda¶

  • Recap: accessing using row labels.
  • Querying.
  • Querying with multiple conditions.
  • Grouping.
  • After class: challenge problems.

Don't forget about these resources!¶

  • DSC 10 Reference Sheet 📌.
  • babypandas notes.
  • babypandas documentation.
  • The Resources tab of the course website.

You belong here! 🤝¶

  • We're moving very quickly in this class.
  • This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
  • We promise, no matter what your prior experience is, the answer is yes, you belong!
    • Watch: 🎥 Developing a Growth Mindset with Carol Dweck.
  • Please come to office hours (see the schedule here) and post on Ed for help – we're here to make sure you succeed in this course.

The data: US states 🗽¶

We'll continue working with the same data from last time.

In [2]:
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states
Out[2]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

Example 4: What is the population density of Pennsylvania?¶

Key concepts: Setting the index. Accessing using row labels.

Population density of Pennsylvania¶

We know how to get the 'Density' of all states. How do we find the one that corresponds to Pennsylvania?

In [3]:
states
Out[3]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [4]:
# Which one is Pennsylvania?
states.get('Density')
Out[4]:
0      99.21
1       1.29
2      62.96
       ...  
47     74.62
48    108.82
49      5.94
Name: Density, Length: 50, dtype: float64

Utilizing the index¶

  • When we load in a DataFrame from a CSV, columns have meaningful names, but rows do not.
In [5]:
bpd.read_csv('data/states.csv')
Out[5]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

  • The row labels (or the index) are how we refer to specific rows. Instead of using numbers, let's refer to these rows by the names of the states they correspond to.
  • This way, we can easily identify, for example, which row corresponds to Pennsylvania.

Setting the index¶

  • To change the index, use .set_index(column_name).
  • Row labels should be unique identifiers.
    • Each row should has a different, descriptive name that corresponds to the contents of that row's data.
In [6]:
states
Out[6]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [7]:
states.set_index('State')
Out[7]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

  • Now there is one fewer column. When you set the index, a column becomes the index, and the old index disappears.
  • 🚨 Like most DataFrame methods, .set_index returns a new DataFrame; it does not modify the original DataFrame.
In [8]:
states
Out[8]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [9]:
states = states.set_index('State')
states
Out[9]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [10]:
# Which one is Pennsylvania? The one whose row label is "Pennsylvania"!
states.get('Density')
Out[10]:
State
Alabama           99.21
Alaska             1.29
Arizona           62.96
                  ...  
West Virginia     74.62
Wisconsin        108.82
Wyoming            5.94
Name: Density, Length: 50, dtype: float64

Accessing using the row label¶

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:

  1. Use .get(column_name) to extract the entire column as a Series.
  2. Use .loc[] to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [11]:
states.get('Density')
Out[11]:
State
Alabama           99.21
Alaska             1.29
Arizona           62.96
                  ...  
West Virginia     74.62
Wisconsin        108.82
Wyoming            5.94
Name: Density, Length: 50, dtype: float64
In [12]:
states.get('Density').loc['Pennsylvania']
Out[12]:
290.60858681804973

Summary: Accessing elements of a DataFrame¶

  • First, .get the appropriate column as a Series.
  • Then, use one of two ways to access an element of a Series:
    • .iloc[] uses the integer position.
    • .loc[] uses the row label.
    • Each is best for different scenarios.
In [13]:
states.get('Density')
Out[13]:
State
Alabama           99.21
Alaska             1.29
Arizona           62.96
                  ...  
West Virginia     74.62
Wisconsin        108.82
Wyoming            5.94
Name: Density, Length: 50, dtype: float64
In [14]:
states.get('Density').iloc[2]
Out[14]:
62.956687853231685
In [15]:
states.get('Density').loc['Arizona']
Out[15]:
62.956687853231685

Note¶

  • Sometimes the integer position and row label are the same.
  • This happens by default with bpd.read_csv.
In [16]:
bpd.read_csv('data/states.csv')
Out[16]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

In [17]:
bpd.read_csv('data/states.csv').get('Capital City').loc[35]
Out[17]:
'Oklahoma City'
In [18]:
bpd.read_csv('data/states.csv').get('Capital City').iloc[35]
Out[18]:
'Oklahoma City'

Quiz 1 material ends here!¶

Example 5: Which states are in the West?¶

Key concept: Querying.

The problem¶

We want to create a DataFrame consisting of only the states whose 'Region' is 'West'. How do we do that?

The solution¶

In [19]:
# This DataFrame only contains rows where the 'Region' is 'West'!
only_west = states[states.get('Region') == 'West']
only_west
Out[19]:
Region Capital City Population Land Area Party Density
State
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
California West Sacramento 39538223 155779 Democratic 253.81
... ... ... ... ... ... ...
Utah West Salt Lake City 3271616 82170 Republican 39.82
Washington West Olympia 7705281 66456 Democratic 115.95
Wyoming West Cheyenne 576851 97093 Republican 5.94

13 rows × 6 columns

🤯 What just happened?

Aside: Booleans¶

  • When we compare two values, the result is either True or False.
    • Notice, these words are not in quotes.
  • bool is a data type in Python, just like int, float, and str.
    • It stands for "Boolean", named after George Boole, an early mathematician.
  • There are only two possible Boolean values: True or False.
    • Yes or no.
    • On or off.
    • 1 or 0.
In [20]:
5 == 6
Out[20]:
False
In [21]:
type(5 == 6)
Out[21]:
bool
In [22]:
9 + 10 < 21
Out[22]:
True

Comparison operators¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

When comparing an entire Series to a single value, the result is a Series of bools (via broadcasting).

In [23]:
states
Out[23]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [24]:
states.get('Region') == 'West'
Out[24]:
State
Alabama          False
Alaska            True
Arizona           True
                 ...  
West Virginia    False
Wisconsin        False
Wyoming           True
Name: Region, Length: 50, dtype: bool

What is a query? 🤔¶

  • A query is code that extracts rows from a DataFrame for which certain condition(s) are true.
  • We use queries to filter DataFrames to contain only the rows that satisfy given conditions.

How do we query a DataFrame?¶

To select only certain rows of states:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
  2. Then pass it into states[sequence_goes_here].
In [25]:
states[states.get('Region') == 'West']
Out[25]:
Region Capital City Population Land Area Party Density
State
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
California West Sacramento 39538223 155779 Democratic 253.81
... ... ... ... ... ... ...
Utah West Salt Lake City 3271616 82170 Republican 39.82
Washington West Olympia 7705281 66456 Democratic 115.95
Wyoming West Cheyenne 576851 97093 Republican 5.94

13 rows × 6 columns

What if the condition isn't satisfied?¶

In [26]:
states[states.get('Region') == 'Pacific Northwest']
Out[26]:
Region Capital City Population Land Area Party Density
State

Example 6: What proportion of US states are Republican?¶

Key concept: Shape of a DataFrame.

Strategy¶
  1. Query to extract a DataFrame of just the states where the 'Party' is 'Republican'.
  2. Count the number of such states.
  3. Divide by the total number of states.
In [27]:
only_rep = states[states.get('Party') == 'Republican']
only_rep
Out[27]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

31 rows × 6 columns

Shape of a DataFrame¶

  • .shape returns the number of rows and columns in a given DataFrame.
    • .shape is not a method, so we don't use parentheses.
    • .shape is an attribute, as it describes the DataFrame.
  • Access each with []:
    • .shape[0] for rows.
    • .shape[1] for columns.
In [28]:
only_rep.shape
Out[28]:
(31, 6)
In [29]:
# Number of rows.
only_rep.shape[0]
Out[29]:
31
In [30]:
# Number of columns.
only_rep.shape[1]
Out[30]:
6
In [31]:
# What proportion of US states are Republican?
only_rep.shape[0] / states.shape[0]
Out[31]:
0.62

Example 7: Which Midwestern state has the most land area?¶

Key concepts: Working with the index. Combining multiple steps.

Strategy¶
  1. Query to extract a DataFrame of just the states in the 'Midwest'.
  2. Sort by 'Land Area' in descending order.
  3. Extract the first element from the index.
In [32]:
midwest = states[states.get('Region') == 'Midwest']
midwest
Out[32]:
Region Capital City Population Land Area Party Density
State
Illinois Midwest Springfield 12812508 55519 Democratic 230.78
Indiana Midwest Indianapolis 6785528 35826 Republican 189.40
Iowa Midwest Des Moines 3190369 55857 Republican 57.12
... ... ... ... ... ... ...
Ohio Midwest Columbus 11799448 40861 Republican 288.77
South Dakota Midwest Pierre 886667 75811 Republican 11.70
Wisconsin Midwest Madison 5893718 54158 Republican 108.82

12 rows × 6 columns

In [33]:
midwest_sorted = midwest.sort_values(by='Land Area', ascending=False)
midwest_sorted
Out[33]:
Region Capital City Population Land Area Party Density
State
Kansas Midwest Topeka 2937880 81759 Republican 35.93
Minnesota Midwest Saint Paul 5706494 79627 Democratic 71.67
Nebraska Midwest Lincoln 1961504 76824 Republican 25.53
... ... ... ... ... ... ...
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Ohio Midwest Columbus 11799448 40861 Republican 288.77
Indiana Midwest Indianapolis 6785528 35826 Republican 189.40

12 rows × 6 columns

  • The answer is Kansas, but how do we get it in code?
In [34]:
midwest_sorted.get('State').iloc[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_26705/3899766623.py in <module>
----> 1 midwest_sorted.get('State').iloc[0]

~/opt/anaconda3/lib/python3.9/site-packages/babypandas/utils.py in wrapper(*args, **kwargs)
     18         with warnings.catch_warnings():
     19             warnings.simplefilter("ignore")
---> 20             return func(*args, **kwargs)
     21 
     22     return wrapper

~/opt/anaconda3/lib/python3.9/site-packages/babypandas/bpd.py in get(self, key)
    325         if any(mask):
    326             k = [key] if isinstance(key, str) else key
--> 327             raise KeyError("{} not found in columns".format(np.array(k)[mask]))
    328 
    329         f = _lift_to_pd(self._pd.get)

KeyError: "['State'] not found in columns"

Working with the index¶

  • We can't use .get because .get is only for columns, and there is no column called 'State'.
    • Instead, 'State' is the index of the DataFrame.
  • To extract the index of a DataFrame, use .index.
    • Like .shape, this is an attribute of the DataFrame, not a method. Don't use parentheses.
  • Access particular elements in the index with [].
In [ ]:
midwest_sorted.index
In [35]:
midwest_sorted.index[0]
Out[35]:
'Kansas'

Combining multiple steps¶

  • It is not necessary to define the intermediate variables midwest and midwest_sorted. We can do everything in one line of code.

  • When solving a multi-step problem, develop your solution incrementally. Write one piece of code at a time and run it.

In [36]:
# Final solution, which you should build up one step at a time.
states[states.get('Region') == 'Midwest'].sort_values(by='Land Area', ascending=False).index[0]
Out[36]:
'Kansas'
  • If a line of code gets too long, enclose it in parentheses to split it over multiple lines.
In [37]:
# You can space your code out like this if needed.
(
    states[states.get('Region') == 'Midwest']
    .sort_values(by='Land Area', ascending=False)
    .index[0]
)
Out[37]:
'Kansas'

Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total population of the 'West'?

A. states[states.get('Region') == 'West'].get('Population').sum()

B. states.get('Population').sum()[states.get('Region') == 'West']

C. states['West'].get('Population').sum()

D. More than one of the above.

In [38]:
...
Out[38]:
Ellipsis

Example 8: What are the top three most-populated Republican states in the South?¶

Key concepts: Queries with multiple conditions. Selecting rows by position.

Multiple conditions¶

  • To write a query with multiple conditions, use & for "and" and | for "or".
    • &: All conditions must be true.
    • |: At least one condition must be true.
  • You must use (parentheses) around each condition!
  • 🚨 Don't use the Python keywords and and or here! They do not behave as you'd want.
    • See BPD 10.3 for an explanation.
In [39]:
states[(states.get('Party') == 'Republican') & (states.get('Region') == 'South')]
Out[39]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Arkansas South Little Rock 3011524 52035 Republican 57.87
Florida South Tallahassee 21538187 53625 Republican 401.64
... ... ... ... ... ... ...
Tennessee South Nashville 6910840 41235 Republican 167.60
Texas South Austin 29145505 261232 Republican 111.57
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

In [40]:
# You can also add line breaks within brackets.
states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
Out[40]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Arkansas South Little Rock 3011524 52035 Republican 57.87
Florida South Tallahassee 21538187 53625 Republican 401.64
... ... ... ... ... ... ...
Tennessee South Nashville 6910840 41235 Republican 167.60
Texas South Austin 29145505 261232 Republican 111.57
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

The & and | operators work element-wise!¶

In [41]:
(states.get('Party') == 'Republican')
Out[41]:
State
Alabama          True
Alaska           True
Arizona          True
                 ... 
West Virginia    True
Wisconsin        True
Wyoming          True
Name: Party, Length: 50, dtype: bool
In [42]:
(states.get('Region') == 'South')
Out[42]:
State
Alabama           True
Alaska           False
Arizona          False
                 ...  
West Virginia     True
Wisconsin        False
Wyoming          False
Name: Region, Length: 50, dtype: bool
In [43]:
(states.get('Party') == 'Republican') & (states.get('Region') == 'South')
Out[43]:
State
Alabama           True
Alaska           False
Arizona          False
                 ...  
West Virginia     True
Wisconsin        False
Wyoming          False
Length: 50, dtype: bool

Original Question: What are the top three most-populated Republican states in the South?¶

In [44]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
    .sort_values(by='Population', ascending=False)
)
Out[44]:
Region Capital City Population Land Area Party Density
State
Texas South Austin 29145505 261232 Republican 111.57
Florida South Tallahassee 21538187 53625 Republican 401.64
Georgia South Atlanta 10711908 57513 Republican 186.25
... ... ... ... ... ... ...
Arkansas South Little Rock 3011524 52035 Republican 57.87
Mississippi South Jackson 2961279 46923 Republican 63.11
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

How do we extract the first three rows of this DataFrame?

Using .take to select rows by position¶

  • Querying allows us to select rows that satisfy a certain condition.
  • We can also select rows in specific positions with .take(seqence_of_integer_positions). This keeps only the rows whose positions are in the specified sequence (list/array).
    • This is analogous to using .iloc[] on a Series.
    • It's rare to need to select rows by integer position. Querying is far more useful.
In [45]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region')=='South')]
    .sort_values(by='Population', ascending=False)
    .take([0, 1, 2])
)
Out[45]:
Region Capital City Population Land Area Party Density
State
Texas South Austin 29145505 261232 Republican 111.57
Florida South Tallahassee 21538187 53625 Republican 401.64
Georgia South Atlanta 10711908 57513 Republican 186.25
  • .take(np.arange(3)) could equivalently be used in place of .take([0, 1, 2]).

Extra Practice¶

Write code to answer each question below.

  1. What is the capital city of the state in the 'West' with the largest land area?
  2. How many states in the 'Northeast' have more land area than an average US state?
  3. What is the total population of the 'Midwest', 'South', and 'Northeast?
✅ Click here to see the answers after you've attempted the problems on your own. 1. What is the capital city of the state in the West with the largest land area?
states[states.get('Region') == 'West'].sort_values(by='Land Area', ascending=False).get('Capital City').iloc[0]
2. How many states in the Northeast have more land area than an average US state?
states[(states.get('Region') == 'Northeast') & 
       (states.get('Land Area') > states.get('Land Area').mean())].shape[0]
3. What is the total population of the Midwest, South, and Northeast?
states[(states.get('Region') == 'Midwest') | 
       (states.get('Region') == 'South') | 
       (states.get('Region') == 'Northeast')].get('Population').sum()
     Alternate solution to 3:
states.get('Population').sum() - states[states.get('Region') == 'West'].get('Population').sum()
In [46]:
...
Out[46]:
Ellipsis

Example 9: Which region is most populated?¶

Key concept: Grouping by one column.

Organizing states by region¶

We can find the total population of any one region using the tools we already have.

In [47]:
states[states.get('Region') == 'West'].get('Population').sum()
Out[47]:
78588572
In [48]:
states[states.get('Region') == 'Midwest'].get('Population').sum()
Out[48]:
68985454

But can we find the total population of every region all at the same time, without writing very similar code multiple times? Yes, there is a better way!

A new method: .groupby¶

Observe what happens when we use the .groupby method on states with the argument 'Region'.

In [49]:
states.groupby('Region').sum()
Out[49]:
Population Land Area Density
Region
Midwest 68985454 750524 1298.78
Northeast 57609148 161912 4957.49
South 125576562 868356 3189.37
West 78588572 1751054 881.62

These populations (for the 'West' and 'Midwest') match the ones we found on the previous slide, except now we get the populations for all regions at the same time. What just happened? 🤯

An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets, shown below.

Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0

Let's see what happens under the hood when we run pets.groupby('Species').mean().

In [50]:
show_grouping_animation()

Let's try it out!¶

In [51]:
pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets
Out[51]:
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
In [52]:
pets.groupby('Species').mean()
Out[52]:
Weight Age
Species
cat 17.50 8.5
dog 48.33 2.5
hamster 1.00 3.0

It takes several steps to go from the original pets DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.

Back to states: which region is most populated?¶

In [53]:
states
Out[53]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [54]:
states.groupby('Region').sum()
Out[54]:
Population Land Area Density
Region
Midwest 68985454 750524 1298.78
Northeast 57609148 161912 4957.49
South 125576562 868356 3189.37
West 78588572 1751054 881.62
In [55]:
# Note the use of .index – remember, the index isn't a column!
(
    states
    .groupby('Region')
    .sum()
    .sort_values(by='Population', ascending=False)
    .index[0]
)
Out[55]:
'South'

Using .groupby in general¶

In short, .groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values from different rows with the same value in the specified column.

To use .groupby:

  1. Choose a column to group by.
    • .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
    • In the resulting DataFrame, there will be one row for every unique value in that column.
  1. Choose an aggregation method.
    • The aggregation method will be applied within each group.
    • The aggregation method is applied individually to each column.
      • If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observations on grouping¶

  1. After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.
In [56]:
states
Out[56]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [57]:
states.groupby('Region').sum()
Out[57]:
Population Land Area Density
Region
Midwest 68985454 750524 1298.78
Northeast 57609148 161912 4957.49
South 125576562 868356 3189.37
West 78588572 1751054 881.62

Tip: look for keywords "per," "for each," and "indexed by" when solving problems.

  1. The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, the column will disappear. 🐇🎩
  1. Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.
In [58]:
states.groupby('Region').max()
Out[58]:
Capital City Population Land Area Party Density
Region
Midwest Topeka 12812508 81759 Republican 288.77
Northeast Trenton 20201249 47126 Republican 1263.12
South Tallahassee 29145505 261232 Republican 636.37
West Santa Fe 39538223 570641 Republican 253.81
In [59]:
12812508 / 81759 == 288.77
Out[59]:
False
  1. The column names don't make sense after grouping with the .count() aggregation method.
In [60]:
states.groupby('Region').count()
Out[60]:
Capital City Population Land Area Party Density
Region
Midwest 12 12 12 12 12
Northeast 9 9 9 9 9
South 16 16 16 16 16
West 13 13 13 13 13

Dropping, renaming, and reordering columns¶

Consider dropping unneeded columns and renaming columns as follows:

  1. Use .assign to create a new column containing the same values as the old column(s).
  2. Use .drop(columns=list_of_column_labels) to drop the old column(s).
    • Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
In [61]:
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
                    States=states_by_region.get('Capital City')
                    ).get(['States'])
states_by_region
Out[61]:
States
Region
Midwest 12
Northeast 9
South 16
West 13

Challenge problems: IMDb dataset 🎞️¶

Extra practice¶

We won't cover this section in class. Instead, it's here for you to practice with some harder examples.

The video below walks through the solutions (it's also linked here). You can also see the solutions by clicking the "✅ Click here to see the answer." button below each question.

In [62]:
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')
Out[62]:

Before watching the video or looking at the solutions, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Ed.

In [63]:
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb
Out[63]:
Votes Rating Year Decade
Title
Akira 91652 8.0 1988 1980
Per un pugno di dollari 124671 8.0 1964 1960
Guardians of the Galaxy 527349 8.0 2014 2010
... ... ... ... ...
The Godfather: Part II 692753 9.0 1974 1970
The Shawshank Redemption 1498733 9.2 1994 1990
The Godfather 1027398 9.2 1972 1970

250 rows × 4 columns

Question: How many movies appear from each decade?¶

In [64]:
imdb.groupby('Decade').count()
Out[64]:
Votes Rating Year
Decade
1920 4 4 4
1930 7 7 7
1940 14 14 14
... ... ... ...
1990 42 42 42
2000 50 50 50
2010 29 29 29

10 rows × 3 columns

In [65]:
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');

Question: What was the highest rated movie of the 1990s?¶

Let's try to do this two different ways.

Without grouping¶

In [66]:
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]
Out[66]:
'The Shawshank Redemption'

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

With grouping¶

In [67]:
imdb.reset_index().groupby('Decade').max()
Out[67]:
Title Votes Rating Year
Decade
1920 The Kid 98794 8.3 1927
1930 The Wizard of Oz 259235 8.5 1939
1940 The Treasure of the Sierra Madre 350551 8.6 1949
... ... ... ... ...
1990 Unforgiven 1498733 9.2 1999
2000 Yip Man 1473049 8.9 2009
2010 X-Men: Days of Future Past 1271949 8.7 2015

10 rows × 4 columns

  • It turns out that this method does not yield the correct answer.
  • When we use an aggregation method (e.g. .max()), aggregation is done to each column individually.
  • While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
  • Taking the max is not helpful here.

Question: How many years have more than 3 movies rated above 8.5?¶

In [ ]:
 
✅ Click here to see the answer.
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count()
good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]    
As mentioned below, you can also use:
(good_movies_per_year.get('Votes') > 3).sum() 

Aside: Using .sum() on a boolean array¶

  • Summing a boolean array gives a count of the number of True elements because Python treats True as 1 and False as 0.
  • Can you use that fact here?
In [ ]:
 

Question: Out of the years with more than 3 movies, which had the highest average rating?¶

In [ ]:
 
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The year with the highest average rating, among the years with more than 3 movies
imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1] 

Question: Which year had the longest movie titles, on average?¶

Hint: Use .str.len() on the column or index that contains the names of the movies.

In [ ]:
 
✅ Click here to see the answer.
(
    imdb.assign(title_length=imdb.index.str.len())
    .groupby('Year').mean()
    .sort_values(by='title_length')
    .index[-1]
)
The year is 1964 – take a look at the movies from 1964 by querying!

Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?¶

In [ ]:
 
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The sum of the ratings of movies from years that had at least 3 movies on the list
total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum()

# The total number of movies from years that had at least 3 movies on the list
count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum()

# The correct answer
average_rating = total_rating / count

# Close, but incorrect: 
# Doesn't account for the fact that different years have different numbers of movies on the list
close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()

Summary, next time¶

Summary¶

  • To write queries that involve multiple conditions:
    • Put parentheses around all conditions.
    • Separate conditions using & if you require all to be true, or | if you require at least one to be true.
  • df.groupby(column_name).agg_method() aggregates all rows with the same value for column_name into a single row in the resulting DataFrame, using agg_method() to combine values.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().
    • Tip: look for keywords "per," "for each," and "indexed by" when solving problems.

Next time¶

A picture is worth a 1000 words – it's time to visualize!