# Run this cell to set up packages for lecture.
from lec05_imports import *
Announcements¶
- Quiz 1 is coming up on Wednesday in your assigned quiz session.
- This will be a 20 minute paper-based quiz consisting of short answer and multiple choice questions.
- No aids are allowed (no notes, no calculators, no computers, no reference sheet). Questions are designed with this in mind.
- The quiz covers Lectures 1 through 4, or BPD 1-9 in the
babypandas
notes.- Attend discussion section today to work on practice problems.
- Homework 1 is due Wednesday also; note: working on the homework is excellent preparation for the quiz.
- Come to office hours (see the schedule here) and post on Ed for help!
Agenda¶
- Recap: accessing using row labels.
- Querying.
- Querying with multiple conditions.
- Grouping.
- After class: challenge problems.
Don't forget about these resources!¶
You belong here! 🤝¶
- We're moving very quickly in this class.
- This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
- We promise, no matter what your prior experience is, the answer is yes, you belong!
- Please come to office hours (see the schedule here) and post on Ed for help – we're here to make sure you succeed in this course.
The data: US states 🗽¶
We'll continue working with the same data from last time.
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
... | ... | ... | ... | ... | ... | ... | ... |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 7 columns
Index by state, not row numbers.
states = states.set_index('State')
The problem¶
We want to create a DataFrame consisting of only the states whose 'Region'
is 'West'
. How do we do that?
The solution¶
# This DataFrame only contains rows where the 'Region' is 'West'!
only_west = states[states.get('Region') == 'West']
only_west
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... |
Utah | West | Salt Lake City | 3271616 | 82170 | Republican | 39.82 |
Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
13 rows × 6 columns
🤯 What just happened?
Aside: Booleans (another type)¶
- When we compare two values, the result is either
True
orFalse
.- Notice, these words are not in quotes.
bool
is a data type in Python, just likeint
,float
, andstr
.- It stands for "Boolean", named after George Boole, an early mathematician.
- There are only two possible Boolean values:
True
orFalse
.- Yes or no.
- On or off.
- 1 or 0.
5 == 6
False
type(5 == 6)
bool
9 + 10 < 21
True
Comparison operators¶
There are several types of comparisons we can make.
symbol | meaning |
---|---|
== |
equal to |
!= |
not equal to |
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
When comparing an entire Series to a single value, the result is a Series of bool
s (via broadcasting).
states
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
... | ... | ... | ... | ... | ... | ... |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
states.get('Region') == 'West'
State Alabama False Alaska True Arizona True ... West Virginia False Wisconsin False Wyoming True Name: Region, Length: 50, dtype: bool
What is a query? 🤔¶
- A query is code that extracts rows from a DataFrame for which certain condition(s) are true.
- We use queries to filter DataFrames to contain only the rows that satisfy given conditions.
How do we query a DataFrame?¶
To select only certain rows of states
:
- Make a sequence (list/array/Series) of
True
s (keep) andFalse
s (discard), usually by making a comparison. - Then pass it into
states[sequence_goes_here]
.
states[states.get('Region') == 'West']
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... |
Utah | West | Salt Lake City | 3271616 | 82170 | Republican | 39.82 |
Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
13 rows × 6 columns
What if the condition isn't satisfied?¶
states[states.get('Region') == 'Pacific Northwest']
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State |
states.get('Region') == 'Pacific Northwest'
State Alabama False Alaska False Arizona False ... West Virginia False Wisconsin False Wyoming False Name: Region, Length: 50, dtype: bool
Example 6: What proportion of US states are Republican?¶
Key concept: Shape of a DataFrame.
Strategy¶
- Query to extract a DataFrame of just the states where the
'Party'
is'Republican'
. - Count the number of such states.
- Divide by the total number of states.
only_rep = states[states.get('Party') == 'Republican']
only_rep
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
... | ... | ... | ... | ... | ... | ... |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
31 rows × 6 columns
Shape of a DataFrame¶
.shape
returns the number of rows and columns in a given DataFrame..shape
is not a method, so we don't use parentheses..shape
is an attribute, as it describes the DataFrame.
- Access each with
[]
:.shape[0]
for rows..shape[1]
for columns.
only_rep.shape
(31, 6)
# Number of rows.
only_rep.shape[0]
31
# Number of columns.
only_rep.shape[1]
6
# What proportion of US states are Republican?
only_rep.shape[0] / states.shape[0]
0.62
Example 7: Which Midwestern state has the most land area?¶
Key concepts: Working with the index. Combining multiple steps.
Strategy¶
- Query to extract a DataFrame of just the states in the
'Midwest'
. - Sort by
'Land Area'
in descending order. - Extract the first element from the index.
midwest = states[states.get('Region') == 'Midwest']
midwest
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Illinois | Midwest | Springfield | 12812508 | 55519 | Democratic | 230.78 |
Indiana | Midwest | Indianapolis | 6785528 | 35826 | Republican | 189.40 |
Iowa | Midwest | Des Moines | 3190369 | 55857 | Republican | 57.12 |
... | ... | ... | ... | ... | ... | ... |
Ohio | Midwest | Columbus | 11799448 | 40861 | Republican | 288.77 |
South Dakota | Midwest | Pierre | 886667 | 75811 | Republican | 11.70 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
12 rows × 6 columns
midwest_sorted = midwest.sort_values(by='Land Area', ascending=False)
midwest_sorted
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Kansas | Midwest | Topeka | 2937880 | 81759 | Republican | 35.93 |
Minnesota | Midwest | Saint Paul | 5706494 | 79627 | Democratic | 71.67 |
Nebraska | Midwest | Lincoln | 1961504 | 76824 | Republican | 25.53 |
... | ... | ... | ... | ... | ... | ... |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Ohio | Midwest | Columbus | 11799448 | 40861 | Republican | 288.77 |
Indiana | Midwest | Indianapolis | 6785528 | 35826 | Republican | 189.40 |
12 rows × 6 columns
- The answer is Kansas, but how do we get it in code?
midwest_sorted.get('State').iloc[0]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[21], line 1 ----> 1 midwest_sorted.get('State').iloc[0] File ~/miniforge3/lib/python3.11/site-packages/babypandas/utils.py:20, in suppress_warnings.<locals>.wrapper(*args, **kwargs) 18 with warnings.catch_warnings(): 19 warnings.simplefilter("ignore") ---> 20 return func(*args, **kwargs) File ~/miniforge3/lib/python3.11/site-packages/babypandas/bpd.py:327, in DataFrame.get(self, key) 325 if any(mask): 326 k = [key] if isinstance(key, str) else key --> 327 raise KeyError("{} not found in columns".format(np.array(k)[mask])) 329 f = _lift_to_pd(self._pd.get) 330 return f(key=key) KeyError: "['State'] not found in columns"
Working with the index¶
- We can't use
.get
because.get
is only for columns, and there is no column called'State'
.- Instead,
'State'
is the index of the DataFrame.
- Instead,
- To extract the index of a DataFrame, use
.index
.- Like
.shape
, this is an attribute of the DataFrame, not a method. Don't use parentheses.
- Like
- Access particular elements in the index with
[]
.
midwest_sorted.index
Index(['Kansas', 'Minnesota', 'Nebraska', 'South Dakota', 'North Dakota', 'Missouri', 'Michigan', 'Iowa', 'Illinois', 'Wisconsin', 'Ohio', 'Indiana'], dtype='object', name='State')
midwest_sorted.index[0]
'Kansas'
Combining multiple steps¶
It is not necessary to define the intermediate variables
midwest
andmidwest_sorted
. We can do everything in one line of code.When solving a multi-step problem, develop your solution incrementally. Write one piece of code at a time and run it.
# Full answer, which you should build up one step at a time.
states[states.get('Region') == 'Midwest'].sort_values(by='Land Area', ascending=False).index[0]
'Kansas'
- If a line of code gets too long, enclose it in parentheses to split it over multiple lines.
# You can space your code out like this if needed.
(
states[states.get('Region') == 'Midwest']
.sort_values(by='Land Area', ascending=False)
.index[0]
)
'Kansas'
Concept Check ✅ – Answer at cc.dsc10.com¶
Which expression below evaluates to the total population of the 'West'
?
A. states[states.get('Region') == 'West'].get('Population').sum()
B. states.get('Population').sum()[states.get('Region') == 'West']
C. states['West'].get('Population').sum()
D. More than one of the above.
...
Ellipsis
Example 8: What are the top three most-populated Republican states in the South?¶
Key concepts: Queries with multiple conditions. Selecting rows by position.
Multiple conditions¶
- To write a query with multiple conditions, use
&
for "and" and|
for "or".&
: All conditions must be true.|
: At least one condition must be true.
- You must use
(
parentheses)
around each condition! - 🚨 Don't use the Python keywords
and
andor
here! They do not behave as you'd want.- See BPD 10.3 for an explanation.
states[(states.get('Party') == 'Republican') & (states.get('Region') == 'South')]
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
... | ... | ... | ... | ... | ... | ... |
Tennessee | South | Nashville | 6910840 | 41235 | Republican | 167.60 |
Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
13 rows × 6 columns
# You can also add line breaks within brackets.
states[(states.get('Party') == 'Republican') &
(states.get('Region') == 'South')]
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
... | ... | ... | ... | ... | ... | ... |
Tennessee | South | Nashville | 6910840 | 41235 | Republican | 167.60 |
Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
13 rows × 6 columns
The &
and |
operators work element-wise!¶
(states.get('Party') == 'Republican')
State Alabama True Alaska True Arizona True ... West Virginia True Wisconsin True Wyoming True Name: Party, Length: 50, dtype: bool
(states.get('Region') == 'South')
State Alabama True Alaska False Arizona False ... West Virginia True Wisconsin False Wyoming False Name: Region, Length: 50, dtype: bool
(states.get('Party') == 'Republican') & (states.get('Region') == 'South')
State Alabama True Alaska False Arizona False ... West Virginia True Wisconsin False Wyoming False Length: 50, dtype: bool
Original Question: What are the top three most-populated Republican states in the South?¶
(
states[(states.get('Party') == 'Republican') &
(states.get('Region') == 'South')]
.sort_values(by='Population', ascending=False)
)
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
Georgia | South | Atlanta | 10711908 | 57513 | Republican | 186.25 |
... | ... | ... | ... | ... | ... | ... |
Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
Mississippi | South | Jackson | 2961279 | 46923 | Republican | 63.11 |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
13 rows × 6 columns
How do we extract the first three rows of this DataFrame?
Using .take
to select rows by position¶
- Querying allows us to select rows that satisfy a certain condition.
- We can also select rows in specific positions with
.take(sequence_of_integer_positions)
. This keeps only the rows whose positions are in the specified sequence (list/array).- This is analogous to using
.iloc[]
on a Series. - It's rare to need to select rows by integer position. Querying is far more useful.
- This is analogous to using
(
states[(states.get('Party') == 'Republican') &
(states.get('Region')=='South')]
.sort_values(by='Population', ascending=False)
.take([0, 1, 2])
)
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
Georgia | South | Atlanta | 10711908 | 57513 | Republican | 186.25 |
.take(np.arange(3))
could equivalently be used in place of.take([0, 1, 2])
.
Extra Practice¶
Write code to answer each question below.
- What is the capital city of the state in the
'West'
with the largest land area? - How many states in the
'Northeast'
have more land area than an average US state? - What is the total population of the
'Midwest'
,'South'
, and'Northeast
?
✅ Click here to see the answers after you've attempted the problems on your own.
- What is the capital city of the state in the West with the largest land area?
states[states.get('Region') == 'West'].sort_values(by='Land Area', ascending=False).get('Capital City').iloc[0]
- How many states in the Northeast have more land area than an average US state?
states[(states.get('Region') == 'Northeast') & (states.get('Land Area') > states.get('Land Area').mean())].shape[0]
- What is the total population of the Midwest, South, and Northeast?
states[(states.get('Region') == 'Midwest') | (states.get('Region') == 'South') | (states.get('Region') == 'Northeast')].get('Population').sum()
Alternate solution to 3:
states.get('Population').sum() - states[states.get('Region') == 'West'].get('Population').sum()
...
Ellipsis
Example 9: Which region is most populated?¶
Key concept: Grouping by one column.
Organizing states by region¶
We can find the total population of any one region using the tools we already have.
states[states.get('Region') == 'West'].get('Population').sum()
78588572
states[states.get('Region') == 'Midwest'].get('Population').sum()
68985454
But can we find the total population of every region all at the same time, without writing very similar code multiple times? Yes, there is a better way!
A new method: .groupby
¶
Observe what happens when we use the .groupby
method on states
with the argument 'Region'
.
states.groupby('Region').sum()
Population | Land Area | Density | |
---|---|---|---|
Region | |||
Midwest | 68985454 | 750524 | 1298.78 |
Northeast | 57609148 | 161912 | 4957.49 |
South | 125576562 | 868356 | 3189.37 |
West | 78588572 | 1751054 | 881.62 |
These populations (for the 'West'
and 'Midwest'
) match the ones we found on the previous slide, except now we get the populations for all regions at the same time. What just happened? 🤯
An illustrative example: Pets 🐱 🐶🐹¶
Consider the DataFrame pets
, shown below.
Species | Color | Weight | Age | |
---|---|---|---|---|
0 | dog | black | 40 | 5.0 |
1 | cat | golden | 15 | 8.0 |
2 | cat | black | 20 | 9.0 |
3 | dog | white | 80 | 2.0 |
4 | dog | golden | 25 | 0.5 |
5 | hamster | golden | 1 | 3.0 |
Let's see what happens under the hood when we run pets.groupby('Species').mean()
.
show_grouping_animation()
Let's try it out!¶
pets = bpd.DataFrame().assign(
Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
Weight=[40, 15, 20, 80, 25, 1],
Age=[5, 8, 9, 2, 0.5, 3]
)
pets
Species | Color | Weight | Age | |
---|---|---|---|---|
0 | dog | black | 40 | 5.0 |
1 | cat | golden | 15 | 8.0 |
2 | cat | black | 20 | 9.0 |
3 | dog | white | 80 | 2.0 |
4 | dog | golden | 25 | 0.5 |
5 | hamster | golden | 1 | 3.0 |
pets.groupby('Species').mean()
Weight | Age | |
---|---|---|
Species | ||
cat | 17.50 | 8.5 |
dog | 48.33 | 2.5 |
hamster | 1.00 | 3.0 |
It takes several steps to go from the original pets
DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.
Back to states: which region is most populated?¶
states
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
... | ... | ... | ... | ... | ... | ... |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
states.groupby('Region').sum()
Population | Land Area | Density | |
---|---|---|---|
Region | |||
Midwest | 68985454 | 750524 | 1298.78 |
Northeast | 57609148 | 161912 | 4957.49 |
South | 125576562 | 868356 | 3189.37 |
West | 78588572 | 1751054 | 881.62 |
# Note the use of .index – remember, the index isn't a column!
(
states
.groupby('Region')
.sum()
.sort_values(by='Population', ascending=False)
.index[0]
)
'South'
Using .groupby
in general¶
In short, .groupby
aggregates (collects) all rows with the same value in a specified column (e.g. 'Region'
) into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()
) to combine values from different rows with the same value in the specified column.
To use .groupby
:
- Choose a column to group by.
.groupby(column_name)
will gather rows which have the same value in the specified column (column_name
).- In the resulting DataFrame, there will be one row for every unique value in that column.
- Choose an aggregation method.
- The aggregation method will be applied within each group.
- The aggregation method is applied individually to each column.
- If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output.
- Common aggregation methods include
.count()
,.sum()
,.mean()
,.median()
,.max()
, and.min()
.
Observations on grouping¶
- After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.
states
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
... | ... | ... | ... | ... | ... | ... |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
states.groupby('Region').sum()
Population | Land Area | Density | |
---|---|---|---|
Region | |||
Midwest | 68985454 | 750524 | 1298.78 |
Northeast | 57609148 | 161912 | 4957.49 |
South | 125576562 | 868356 | 3189.37 |
West | 78588572 | 1751054 | 881.62 |
Tip: look for keywords "per," "for each," and "indexed by" when solving problems.
- The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, the column will disappear. 🐇🎩
- Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.
states.groupby('Region').max()
Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|
Region | |||||
Midwest | Topeka | 12812508 | 81759 | Republican | 288.77 |
Northeast | Trenton | 20201249 | 47126 | Republican | 1263.12 |
South | Tallahassee | 29145505 | 261232 | Republican | 636.37 |
West | Santa Fe | 39538223 | 570641 | Republican | 253.81 |
12812508 / 81759 == 288.77
False
- The column names don't make sense after grouping with the
.count()
aggregation method.
states.groupby('Region').count()
Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|
Region | |||||
Midwest | 12 | 12 | 12 | 12 | 12 |
Northeast | 9 | 9 | 9 | 9 | 9 |
South | 16 | 16 | 16 | 16 | 16 |
West | 13 | 13 | 13 | 13 | 13 |
Dropping, renaming, and reordering columns¶
Consider dropping unneeded columns and renaming columns as follows:
- Use
.assign
to create a new column containing the same values as the old column(s). - Use
.drop(columns=list_of_column_labels)
to drop the old column(s).- Alternatively, use
.get(list_of_column_labels)
to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
- Alternatively, use
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
States=states_by_region.get('Capital City')
).get(['States'])
states_by_region
States | |
---|---|
Region | |
Midwest | 12 |
Northeast | 9 |
South | 16 |
West | 13 |
Challenge problems: IMDb dataset 🎞️¶
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')
Before watching the video or looking at the solutions, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Ed.
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb
Votes | Rating | Year | Decade | |
---|---|---|---|---|
Title | ||||
Akira | 91652 | 8.0 | 1988 | 1980 |
Per un pugno di dollari | 124671 | 8.0 | 1964 | 1960 |
Guardians of the Galaxy | 527349 | 8.0 | 2014 | 2010 |
... | ... | ... | ... | ... |
The Godfather: Part II | 692753 | 9.0 | 1974 | 1970 |
The Shawshank Redemption | 1498733 | 9.2 | 1994 | 1990 |
The Godfather | 1027398 | 9.2 | 1972 | 1970 |
250 rows × 4 columns
Question: How many movies appear from each decade?¶
imdb.groupby('Decade').count()
Votes | Rating | Year | |
---|---|---|---|
Decade | |||
1920 | 4 | 4 | 4 |
1930 | 7 | 7 | 7 |
1940 | 14 | 14 | 14 |
... | ... | ... | ... |
1990 | 42 | 42 | 42 |
2000 | 50 | 50 | 50 |
2010 | 29 | 29 | 29 |
10 rows × 3 columns
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');
Question: What was the highest rated movie of the 1990s?¶
Let's try to do this two different ways.
Without grouping¶
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]
'The Shawshank Redemption'
Note: The command to extract the index of a DataFrame is .index
- no parentheses! This is different than the way we extract columns, with .get()
, because the index is not a column.
With grouping¶
imdb.reset_index().groupby('Decade').max()
Title | Votes | Rating | Year | |
---|---|---|---|---|
Decade | ||||
1920 | The Kid | 98794 | 8.3 | 1927 |
1930 | The Wizard of Oz | 259235 | 8.5 | 1939 |
1940 | The Treasure of the Sierra Madre | 350551 | 8.6 | 1949 |
... | ... | ... | ... | ... |
1990 | Unforgiven | 1498733 | 9.2 | 1999 |
2000 | Yip Man | 1473049 | 8.9 | 2009 |
2010 | X-Men: Days of Future Past | 1271949 | 8.7 | 2015 |
10 rows × 4 columns
- It turns out that this method does not yield the correct answer.
- When we use an aggregation method (e.g.
.max()
), aggregation is done to each column individually. - While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
- Taking the
max
is not helpful here.
Question: How many years have more than 3 movies rated above 8.5?¶
✅ Click here to see the answer.
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count() good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]
As mentioned below, you can also use:
(good_movies_per_year.get('Votes') > 3).sum()
Aside: Using .sum()
on a boolean array¶
- Summing a boolean array gives a count of the number of
True
elements because Python treatsTrue
as 1 andFalse
as 0. - Can you use that fact here?
Question: Out of the years with more than 3 movies, which had the highest average rating?¶
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3 # The year with the highest average rating, among the years with more than 3 movies imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1]
Question: Which year had the longest movie titles, on average?¶
Hint: Use .str.len()
on the column or index that contains the names of the movies.
✅ Click here to see the answer.
( imdb.assign(title_length=imdb.index.str.len()) .groupby('Year').mean() .sort_values(by='title_length') .index[-1] )
The year is 1964 – take a look at the movies from 1964 by querying!
Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?¶
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3 # The sum of the ratings of movies from years that had at least 3 movies on the list total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum() # The total number of movies from years that had at least 3 movies on the list count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum() # The correct answer average_rating = total_rating / count # Close, but incorrect: # Doesn't account for the fact that different years have different numbers of movies on the list close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()
Summary, next time¶
Summary¶
- To write queries that involve multiple conditions:
- Put parentheses around all conditions.
- Separate conditions using
&
if you require all to be true, or|
if you require at least one to be true.
df.groupby(column_name).agg_method()
aggregates all rows with the same value forcolumn_name
into a single row in the resulting DataFrame, usingagg_method()
to combine values.- Common aggregation methods include
.count()
,.sum()
,.mean()
,.median()
,.max()
, and.min()
. - Tip: look for keywords "per," "for each," and "indexed by" when solving problems.
- Common aggregation methods include
Next time¶
A picture is worth a 1000 words – it's time to visualize!