In [1]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd

import matplotlib.pyplot as plt
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
import pandas as pd
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 5 – More Querying and GroupBy¶

DSC 10, Spring 2023¶

Announcements¶

  • Lab 1 is due on Saturday at 11:59PM.
  • Homework 1 is due on Tuesday at 11:59PM.
    • Do Lab 1 before Homework 1.
    • Avoid submission errors.
  • Discussion 2 is today, and we'll be covering old exam problems on this week's material.
    • You must be present when attendance is taken in discussion to get credit, even if you have a conflicting class.

Agenda¶

  • Recap: Queries.
  • Queries with multiple conditions.
  • GroupBy.
  • Extra practice, including challenge problems.

Don't forget about these resources!¶

  • DSC 10 Reference Sheet 📌.
  • babypandas notes.
  • babypandas documentation.
  • The Resources tab of the course website.

You belong here! 🫂¶

  • We're moving very quickly in this class.
  • This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
  • We promise, no matter what your prior experience is, the answer is yes, you belong!
    • Watch: 🎥 Developing a Growth Mindset with Carol Dweck.
  • Please come to office hours (see the schedule here) and post on Ed for help – we're here to make sure you succeed in this course.

About the Data: Get It Done service requests 👷¶

Recall, the requests DataFrame contains a summary of all service requests so far this year, broken down by neighborhood and service.

In [2]:
requests = bpd.read_csv('data/get-it-done-requests.csv')
requests = requests.assign(total=requests.get('closed') + requests.get('open'))
requests
Out[2]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
... ... ... ... ... ...
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

Recap: Queries¶

What is a query? 🤔¶

  • A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
  • We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

Comparison operators¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to
In [3]:
5 == 6
Out[3]:
False
In [4]:
type(5 == 6)
Out[4]:
bool
In [5]:
9 + 10 < 21
Out[5]:
True
In [6]:
'zebra' == 'zeb' + 'ra'
Out[6]:
True

How do we query a DataFrame?¶

To select only certain rows of requests:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
  2. Then pass it into requests[sequence_goes_here].
In [7]:
requests
Out[7]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
... ... ... ... ... ...
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [8]:
# A Boolean Series.
requests.get('closed') > 5
Out[8]:
0        True
1       False
2        True
        ...  
1418     True
1419    False
1420    False
Name: closed, Length: 1421, dtype: bool
In [9]:
# A query.
requests[requests.get('closed') > 5]
Out[9]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
... ... ... ... ... ...
1413 Uptown Waste on Private Property 40 2 42
1414 Uptown Weed Cleanup 21 3 24
1418 Via De La Valle Pothole 11 7 18

920 rows × 5 columns

Example 5: Which neighborhood has the most 'Pothole' requests? 🕳¶

Key concept: Querying.

Strategy¶

  1. Query to extract a DataFrame of just the 'Pothole' requests.
  2. Sort by 'total' in descending order.
  3. Extract the first element from the 'neighborhood' column.
In [10]:
# This DataFrame only contains rows where the 'service' is 'Pothole'!
only_potholes = requests[requests.get('service') == 'Pothole']
only_potholes
Out[10]:
neighborhood service closed open total
12 Balboa Park Pothole 70 8 78
40 Barrio Logan Pothole 111 9 120
67 Black Mountain Ranch Pothole 19 1 20
... ... ... ... ... ...
1366 University Pothole 712 249 961
1396 Uptown Pothole 513 87 600
1418 Via De La Valle Pothole 11 7 18

56 rows × 5 columns

In [11]:
# You can space your code out like this if needed.
(
    only_potholes
    .sort_values('total', ascending=False)
    .get('neighborhood')
    .iloc[0]
)
Out[11]:
'Clairemont Mesa'

What if the condition isn't satisfied?¶

In [12]:
requests[requests.get('service') == 'Car Maintenance']
Out[12]:
neighborhood service closed open total

Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

In [13]:
...
Out[13]:
Ellipsis

Activity 🚘¶

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

In [14]:
...
Out[14]:
Ellipsis

Example 6: How many service requests were for 'Pothole' or 'Dead Animal'?¶

Key concept: Queries with multiple conditions.

Multiple conditions¶

  • To write a query with multiple conditions, use & for "and" and | for "or".
    • &: All conditions must be true.
    • |: At least one condition must be true.
  • You must use (parentheses) around each condition!
  • 🚨 Don't use the Python keywords and and or here! They do not behave as you'd want.
    • See BPD 10.3 for an explanation.
In [15]:
requests[(requests.get('service') == 'Pothole') | (requests.get('service') == 'Dead Animal')]
Out[15]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
12 Balboa Park Pothole 70 8 78
28 Barrio Logan Dead Animal 2 0 2
... ... ... ... ... ...
1396 Uptown Pothole 513 87 600
1415 Via De La Valle Dead Animal 1 0 1
1418 Via De La Valle Pothole 11 7 18

109 rows × 5 columns

In [16]:
# You can add line breaks within brackets or parentheses.
requests[(requests.get('service') == 'Pothole') | 
         (requests.get('service') == 'Dead Animal')]
Out[16]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
12 Balboa Park Pothole 70 8 78
28 Barrio Logan Dead Animal 2 0 2
... ... ... ... ... ...
1396 Uptown Pothole 513 87 600
1415 Via De La Valle Dead Animal 1 0 1
1418 Via De La Valle Pothole 11 7 18

109 rows × 5 columns

The & and | operators work element-wise!¶

In [17]:
(requests.get('service') == 'Pothole')
Out[17]:
0       False
1       False
2       False
        ...  
1418     True
1419    False
1420    False
Name: service, Length: 1421, dtype: bool
In [18]:
(requests.get('service') == 'Dead Animal')
Out[18]:
0        True
1       False
2       False
        ...  
1418    False
1419    False
1420    False
Name: service, Length: 1421, dtype: bool
In [19]:
(requests.get('service') == 'Pothole') | (requests.get('service') == 'Dead Animal')
Out[19]:
0        True
1       False
2       False
        ...  
1418     True
1419    False
1420    False
Name: service, Length: 1421, dtype: bool

Original Question: How many service requests were for 'Pothole' or 'Dead Animal'?¶

In [20]:
requests[(requests.get('service') == 'Pothole') | 
         (requests.get('service') == 'Dead Animal')].get('total').sum()
Out[20]:
18512

Concept Check ✅ – Answer at cc.dsc10.com¶

Each of the following questions can be answered by querying the requests DataFrame.

  1. Which neighborhood had the most 'Street Flooded' requests?
  2. In the 'Kearny Mesa' neighborhood, how many different types of services have open requests?
  3. How many requests have been closed in the 'La Jolla' neighborhood?

How many of the questions above require the query to have multiple conditions?

A. 0              B. 1              C. 2              D. 3

Bonus: Try to write code to answer each question.

In [21]:
...
Out[21]:
Ellipsis

Aside: Using .take to select rows by position¶

  • Querying allows us to select rows that satisfy a certain condition.
  • We can also select rows in specific positions with .take([list_of_integer_positions]). This keeps only the rows whose positions are in the specified list.
    • This is analogous to using .iloc[] on a Series.
    • It's rare to need to select rows by integer position. Querying is far more useful.
In [22]:
requests
Out[22]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
... ... ... ... ... ...
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [23]:
requests.take([1, 3, 5])
Out[23]:
neighborhood service closed open total
1 Balboa Park Development Services - Code Enforcement 2 0 2
3 Balboa Park Environmental Services Code Compliance 8 1 9
5 Balboa Park Graffiti - Public 343 25 368
In [24]:
requests.get('service').iloc[[1, 3, 5]]
Out[24]:
1    Development Services - Code Enforcement
3     Environmental Services Code Compliance
5                          Graffiti - Public
Name: service, dtype: object
In [25]:
requests.take(np.arange(5))
Out[25]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
4 Balboa Park Graffiti - Code Enforcement 2 1 3

Example 7: Which neighborhood had the most requests?¶

Key concept: Grouping by one column.

Organizing requests by neighborhood¶

We can find the total number of Get It Done requests in any one neighborhood using the tools we already have.

In [26]:
requests[requests.get('neighborhood') == 'Black Mountain Ranch'].get('total').sum()
Out[26]:
195
In [27]:
requests[requests.get('neighborhood') == 'Uptown'].get('total').sum()
Out[27]:
7345

If we wanted to find the total number of requests in every neighborhood, this would be quite inconvenient... there has to be a better way!

A new method: .groupby¶

Observe what happens when we use the .groupby method on requests with the argument 'neighborhood'.

In [28]:
requests.groupby('neighborhood').sum()
Out[28]:
closed open total
neighborhood
Balboa Park 1173 261 1434
Barrio Logan 1185 201 1386
Black Mountain Ranch 151 44 195
... ... ... ...
University 1614 620 2234
Uptown 5654 1691 7345
Via De La Valle 13 12 25

57 rows × 3 columns

Note that the 'total' counts for Black Mountain Ranch and Uptown are the same as we saw on the previous slide. What just happened? 🤯

An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets, shown below.

Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0

When we run pets.groupby('Species').mean(), babypandas does three things under the hood.

Step 1: Split¶

First, it splits the rows of pets into "groups" according to their values in the 'Species' column.

🐶
Species Color Weight Age
0 dog black 40 5.0
3 dog white 80 2.0
4 dog golden 25 0.5


🐱

Species Color Weight Age
1 cat golden 15 8.0
2 cat black 20 9.0


🐹

Species Color Weight Age
5 hamster golden 1 3.0

Step 2: Aggregate¶

Then, it aggregates the rows with the same value of 'Species' by taking the mean of all numerical columns.

Weight Age
dog 48.33 2.5


Weight Age
cat 17.5 8.5


Weight Age
hamster 1.0 3.0

</div>

Step 3: Combine¶

Finally, it combines these means into a new DataFrame that is indexed by 'Species' and sorted by 'Species' in ascending order.

Weight Age
Species
cat 17.50 8.5
dog 48.33 2.5
hamster 1.00 3.0

Let's try it out!¶

In [29]:
pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets
Out[29]:
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
In [30]:
pets.groupby('Species').mean()
Out[30]:
Weight Age
Species
cat 17.50 8.5
dog 48.33 2.5
hamster 1.00 3.0

Back to Get It Done service requests 👷¶

In [31]:
requests
Out[31]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
... ... ... ... ... ...
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [32]:
requests.groupby('neighborhood').sum()
Out[32]:
closed open total
neighborhood
Balboa Park 1173 261 1434
Barrio Logan 1185 201 1386
Black Mountain Ranch 151 44 195
... ... ... ...
University 1614 620 2234
Uptown 5654 1691 7345
Via De La Valle 13 12 25

57 rows × 3 columns

Our original goal was to find the neighborhood with the most total requests, so after grouping, we need to sort:

In [33]:
# Note the use of .index – remember, the index isn't a column!
(
    requests
    .groupby('neighborhood')
    .sum()
    .sort_values(by='total', ascending=False)
    .index[0]
)
Out[33]:
'Downtown'

Using .groupby in general¶

In short, .groupby aggregates all rows with the same value in a specified column (e.g. 'neighborhood') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values.

  1. Choose a column to group by.
    • .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
    • On the previous slide, we grouped by 'neighborhood'.
    • In the resulting DataFrame, there was one row for every unique value of 'neighborhood'.
  1. Choose an aggregation method.
    • The aggregation method will be applied within each group.
    • On the previous slide, we applied the .sum() method to every 'neighborhood'.
    • The aggregation method is applied individually to each column (e.g. the sums were computed separately for 'closed', 'open', and 'total').
      • If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output – we'll look at this in more detail shortly.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observation #1¶

  • The index has changed to neighborhood names.
  • In general, the new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.
In [34]:
requests
Out[34]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
... ... ... ... ... ...
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [35]:
requests.groupby('neighborhood').sum()
Out[35]:
closed open total
neighborhood
Balboa Park 1173 261 1434
Barrio Logan 1185 201 1386
Black Mountain Ranch 151 44 195
... ... ... ...
University 1614 620 2234
Uptown 5654 1691 7345
Via De La Valle 13 12 25

57 rows × 3 columns

Observation #2¶

The 'service' column has disappeared. Why?

In [36]:
requests
Out[36]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
... ... ... ... ... ...
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [37]:
requests.groupby('neighborhood').sum()
Out[37]:
closed open total
neighborhood
Balboa Park 1173 261 1434
Barrio Logan 1185 201 1386
Black Mountain Ranch 151 44 195
... ... ... ...
University 1614 620 2234
Uptown 5654 1691 7345
Via De La Valle 13 12 25

57 rows × 3 columns

Disappearing columns ✨🐇🎩¶

  • The aggregation method – .sum(), in this case – is applied to each column.
  • If it doesn't make sense to apply it to a particular column, that column will disappear.
  • For instance, we can't sum strings, like in the 'service' column.
  • However, we can compute the max of several strings. How?
In [38]:
# Can you guess how the max is determined?
requests.groupby('neighborhood').max() 
Out[38]:
service closed open total
neighborhood
Balboa Park Weed Cleanup 343 57 368
Barrio Logan Weed Cleanup 286 48 317
Black Mountain Ranch Tree Maintenance 41 14 44
... ... ... ... ...
University Weed Cleanup 712 249 961
Uptown Weed Cleanup 1422 384 1806
Via De La Valle Street Sweeping 11 7 18

57 rows × 4 columns

Observation #3¶

  • The aggregation method is applied to each column separately.
  • The rows of the resulting DataFrame need to be interpreted with care.
In [39]:
requests.groupby('neighborhood').max()
Out[39]:
service closed open total
neighborhood
Balboa Park Weed Cleanup 343 57 368
Barrio Logan Weed Cleanup 286 48 317
Black Mountain Ranch Tree Maintenance 41 14 44
... ... ... ... ...
University Weed Cleanup 712 249 961
Uptown Weed Cleanup 1422 384 1806
Via De La Valle Street Sweeping 11 7 18

57 rows × 4 columns

Why isn't the 'total' column equal to the sum of the 'closed' and 'open' columns, as it originally was?

In [40]:
# Why don't these numbers match those in the grouped DataFrame?
requests[(requests.get('neighborhood') == 'Balboa Park') & (requests.get('service') == 'Weed Cleanup')]
Out[40]:
neighborhood service closed open total
27 Balboa Park Weed Cleanup 1 0 1

Example: Number of different services¶

How do we find the number of different services requested in each neighborhood?

As always when using groupby, there are two steps:

  1. Choose a column to group by.
    • Here, 'neighborhood' seems like a good choice.
  1. Choose an aggregation method.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().
In [41]:
# How many different requests are there for the neighborhood 'University'?
requests[requests.get('neighborhood') == 'University']
Out[41]:
neighborhood service closed open total
1354 University Dead Animal 25 0 25
1355 University Development Services - Code Enforcement 7 2 9
1356 University Encampment 55 27 82
... ... ... ... ... ...
1381 University Tree Maintenance 47 8 55
1382 University Waste on Private Property 5 1 6
1383 University Weed Cleanup 1 4 5

30 rows × 5 columns

In [42]:
# How do we find this result for every neighborhood?

Observation #4¶

The column names of the output of .groupby don't make sense when using the .count() aggregation method.

In [43]:
num_diff_services = requests.groupby('neighborhood').count()
num_diff_services
Out[43]:
service closed open total
neighborhood
Balboa Park 28 28 28 28
Barrio Logan 28 28 28 28
Black Mountain Ranch 24 24 24 24
... ... ... ... ...
University 30 30 30 30
Uptown 31 31 31 31
Via De La Valle 6 6 6 6

57 rows × 4 columns

Consider dropping unneeded columns and renaming columns as follows:

  1. Use .assign to create a new column containing the same values as the old column(s).
  2. Use .drop(columns=list_of_column_labels) to drop the old column(s). Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
In [44]:
num_diff_services = num_diff_services.assign(
                    count_of_services=num_diff_services.get('open')
                    ).drop(columns=['service', 'closed', 'open', 'total'])
num_diff_services
Out[44]:
count_of_services
neighborhood
Balboa Park 28
Barrio Logan 28
Black Mountain Ranch 24
... ...
University 30
Uptown 31
Via De La Valle 6

57 rows × 1 columns

More practice: IMDb dataset 🎞️¶

Challenge problems!¶

We won't cover this section in class. Instead, it's here for you to practice with some harder examples.

The video below walks through the solutions (it's also linked here). You can also see the solutions by clicking the "✅ Click here to see the answer." button below each question.

In [45]:
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')
Out[45]:

Before watching the video or looking at the solutions, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Ed.

In [46]:
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb
Out[46]:
Votes Rating Year Decade
Title
Akira 91652 8.0 1988 1980
Per un pugno di dollari 124671 8.0 1964 1960
Guardians of the Galaxy 527349 8.0 2014 2010
... ... ... ... ...
The Godfather: Part II 692753 9.0 1974 1970
The Shawshank Redemption 1498733 9.2 1994 1990
The Godfather 1027398 9.2 1972 1970

250 rows × 4 columns

Question: How many movies appear from each decade?¶

In [47]:
imdb.groupby('Decade').count()
Out[47]:
Votes Rating Year
Decade
1920 4 4 4
1930 7 7 7
1940 14 14 14
... ... ... ...
1990 42 42 42
2000 50 50 50
2010 29 29 29

10 rows × 3 columns

In [48]:
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');

Question: What was the highest rated movie of the 1990s?¶

Let's try to do this two different ways.

Without grouping¶

In [49]:
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]
Out[49]:
'The Shawshank Redemption'

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

With grouping¶

In [50]:
imdb.reset_index().groupby('Decade').max()
Out[50]:
Title Votes Rating Year
Decade
1920 The Kid 98794 8.3 1927
1930 The Wizard of Oz 259235 8.5 1939
1940 The Treasure of the Sierra Madre 350551 8.6 1949
... ... ... ... ...
1990 Unforgiven 1498733 9.2 1999
2000 Yip Man 1473049 8.9 2009
2010 X-Men: Days of Future Past 1271949 8.7 2015

10 rows × 4 columns

  • It turns out that this method does not yield the correct answer.
  • When we use an aggregation method (e.g. .max()), aggregation is done to each column individually.
  • While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
  • Taking the max is not helpful here.

Question: How many years have more than 3 movies rated above 8.5?¶

In [ ]:
 
✅ Click here to see the answer.
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count()
good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]    
As mentioned below, you can also use:
(good_movies_per_year.get('Votes') > 3).sum() 

Aside: Using .sum() on a boolean array¶

  • Summing a boolean array gives a count of the number of True elements because Python treats True as 1 and False as 0.
  • Can you use that fact here?
In [ ]:
 

Question: Out of the years with more than 3 movies, which had the highest average rating?¶

In [ ]:
 
✅ Click here to see the answer.
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3
imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1]

Question: Which year had the longest movie titles, on average?¶

Hint: Use .str.len() on the column or index that contains the names of the movies.

In [ ]:
 
✅ Click here to see the answer.
(
    imdb.assign(title_length=imdb.index.str.len())
    .groupby('Year').mean()
    .sort_values(by='title_length')
    .index[-1]
)
The year is 1964 – take a look at the movies from 1964 by querying!

Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?¶

In [ ]:
 
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The sum of the ratings of movies from years that had at least 3 movies on the list
total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum()

# The total number of movies from years that had at least 3 movies on the list
count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum()

# The correct answer
average_rating = total_rating / count

# Close, but incorrect: 
# Doesn't account for the fact that different years have different numbers of movies on the list
close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()

Summary, next time¶

Summary¶

  • We can write queries that involve multiple conditions, as long as we:
    • Put parentheses around all conditions.
    • Separate conditions using & if you require all to be true, or | if you require at least one to be true.
  • The method call df.groupby(column_name).agg_method() aggregates all rows with the same value for column_name into a single row in the resulting DataFrame, using agg_method() to combine values.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Next time¶

A picture is worth a 1000 words – it's time to visualize!