import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')
DataFrameGroupBy
objects and aggregation.DataFrameGroupBy
methods.Name | Assignment | Score |
---|---|---|
Billy | Homework 1 | 94 |
Sally | Homework 1 | 98 |
Molly | Homework 1 | 82 |
Sally | Homework 2 | 47 |
Student Name | Quarter | Course | Instructor | Recommend? | Expected Grade | Hours Per Week | Comments |
---|---|---|---|---|---|---|---|
Billy | WI23 | DSC 80 | Suraj Rampure | No | A- | 14 | I hate this class |
Billy | WI23 | DSC 40B | Justin Eldridge | Yes | B+ | 9 | go big O |
Sally | WI23 | DSC 10 | Janine Tiefenbruck | Yes | A | 11 | baby pandas are so cute |
Molly | WI23 | DSC 80 | Suraj Rampure | Yes | A+ | 2 | I wish there was music in class |
Molly | WI23 | DSC 95 | Marina Langlois | No | A | 3 | I loved DSC 30, but 95 wasn't hard enough :( |
Only instructors can see individual responses. At cape.ucsd.edu, overall class statistics are visible.
Quarter | Course | Instructor | Recommend (%) | Expected Grade | Hours Per Week |
---|---|---|---|---|---|
WI23 | DSC 80 | Suraj Rampure | 6% | 3.15 (B) | 13.32 |
WI23 | DSC 40B | Justin Eldridge | 89% | 3.35 (B+) | 8.54 |
WI23 | DSC 10 | Janine Tiefenbruck | 94% | 3.45 (B+) | 11.49 |
WI23 | DSC 95 | Marina Langlois | 91% | 4.0 (A) | 9.21 |
The university may be interested in looking at CAPEs results by department.
Quarter | Department | Recommend (%) | Expected Grade | Hours Per Week |
---|---|---|---|---|
WI23 | DSC | 91% | 3.01 (B) | 12.29 |
WI23 | BILD | 85% | 2.78 (C+) | 13.21 |
Prospective students may be interested in comparing course evaluations across different universities.
University | Recommend (%) | Average GPA | Hours Per Week |
---|---|---|---|
UC San Diego | 94% | 3.12 (B) | 42.19 |
UC Irvine | 89% | 3.15 (B) | 38.44 |
SDSU | 88% | 2.99 (B-) | 36.89 |
The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
penguins['species'].value_counts()
Adelie 146 Gentoo 119 Chinstrap 68 Name: species, dtype: int64
penguins['island'].value_counts()
Biscoe 163 Dream 123 Torgersen 47 Name: island, dtype: int64
Create a Series, indexed by 'species'
, that contains the mean 'body_mass_g'
of each 'species'
.
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
species_map = pd.Series([], dtype=float)
for species in penguins['species'].unique():
species_only = penguins.loc[penguins['species'] == species]
species_map.loc[species] = species_only['body_mass_g'].mean()
species_map
Adelie 3706.164384 Chinstrap 3733.088235 Gentoo 5092.436975 dtype: float64
'species'
, we make a pass through the entire dataset.penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
penguins.groupby('species').mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
species | ||||
Adelie | 38.823973 | 18.347260 | 190.102740 | 3706.164384 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
Somehow, the groupby
method computes what we're looking for in just one line. How?
Pandas Tutor is the equivalent of Python Tutor, which you saw in DSC 20, but for pandas
.
You can use it by:
pandas
code directly.!pip install pandas-tutor
), enabling it in your notebook (%reload_ext pandas_tutor
) and using the %%pt
cell magic to visualize the last expression in a cell. We'll do the latter, since it doesn't require us leaving our Jupyter Notebook (and because the website doesn't allow us to upload datasets).
%reload_ext pandas_tutor
%%pt
penguins.groupby('species').mean()
The groupby
method involves three steps: split, apply, and combine. This is the same terminology that the pandas
documentation uses.
Before we dive deep into the internals, let's look at a few more examples.
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
Which 'species'
has the highest median 'bill_length_mm'
?
penguins.groupby('species').median()['bill_length_mm'].idxmax()
'Chinstrap'
What proportion of penguins of each 'species'
live on 'Dream'
island?
(
penguins.assign(on_Dream=penguins['island'] == 'Dream')
.groupby('species')
.mean()['on_Dream']
)
species Adelie 0.376712 Chinstrap 1.000000 Gentoo 0.000000 Name: on_Dream, dtype: float64
Note that groupby
is a declarative operation – the user just specifies what computation needs to be done, and pandas
figures out how to do it under the hood.
DataFrameGroupBy
objects and aggregation¶DataFrameGroupBy
objects¶We've just evaluated a few expressions of the following form.
penguins.groupby('species').mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
species | ||||
Adelie | 38.823973 | 18.347260 | 190.102740 | 3706.164384 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
There are two method calls in the expression above: .groupby('species')
and .mean()
. What happens if we remove the latter?
penguins.groupby('species')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff2a92b2e80>
If df
is a DataFrame, then df.groupby(key)
returns a DataFrameGroupBy
object.
This object represents the "split" in "split-apply-combine".
# Creates one group for each unique value in the species column.
penguin_groups = penguins.groupby('species')
penguin_groups
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff2a932b6a0>
DataFrameGroupBy
objects have a groups
attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.
penguin_groups.groups
{'Adelie': [0, 1, 2, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, ...], 'Chinstrap': [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219], 'Gentoo': [220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, ...]}
DataFrameGroupBy
objects also have a get_group(key)
method, which returns a DataFrame with only the values for the given key.
penguin_groups.get_group('Chinstrap')
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
215 | Chinstrap | Dream | 55.8 | 19.8 | 207.0 | 4000.0 | Male |
216 | Chinstrap | Dream | 43.5 | 18.1 | 202.0 | 3400.0 | Female |
217 | Chinstrap | Dream | 49.6 | 18.2 | 193.0 | 3775.0 | Male |
218 | Chinstrap | Dream | 50.8 | 19.0 | 210.0 | 4100.0 | Male |
219 | Chinstrap | Dream | 50.2 | 18.7 | 198.0 | 3775.0 | Female |
68 rows × 7 columns
# Same as the above!
penguins[penguins['species'] == 'Chinstrap']
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
215 | Chinstrap | Dream | 55.8 | 19.8 | 207.0 | 4000.0 | Male |
216 | Chinstrap | Dream | 43.5 | 18.1 | 202.0 | 3400.0 | Female |
217 | Chinstrap | Dream | 49.6 | 18.2 | 193.0 | 3775.0 | Male |
218 | Chinstrap | Dream | 50.8 | 19.0 | 210.0 | 4100.0 | Male |
219 | Chinstrap | Dream | 50.2 | 18.7 | 198.0 | 3775.0 | Female |
68 rows × 7 columns
We usually don't use these attributes and methods, but they're useful in understanding how groupby
works under the hood.
DataFrameGroupBy
object, we need to apply some function to each group, and combine the results.DataFrameGroupBy
object, e.g. .mean()
, .max()
, or .median()
.Let's look at some examples.
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
penguins.groupby('species').mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
species | ||||
Adelie | 38.823973 | 18.347260 | 190.102740 | 3706.164384 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
penguins.groupby('species').sum()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
species | ||||
Adelie | 5668.3 | 2678.7 | 27755.0 | 541100.0 |
Chinstrap | 3320.7 | 1252.6 | 13316.0 | 253850.0 |
Gentoo | 5660.6 | 1784.6 | 25851.0 | 606000.0 |
penguins.groupby('species').last()
island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|
species | ||||||
Adelie | Dream | 41.5 | 18.5 | 201.0 | 4000.0 | Male |
Chinstrap | Dream | 50.2 | 18.7 | 198.0 | 3775.0 | Female |
Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
penguins.groupby('species').max()
island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|
species | ||||||
Adelie | Torgersen | 46.0 | 21.5 | 210.0 | 4775.0 | Male |
Chinstrap | Dream | 58.0 | 20.8 | 212.0 | 4800.0 | Male |
Gentoo | Biscoe | 59.6 | 17.3 | 231.0 | 6300.0 | Male |
Within each group, the aggregation method is applied to each column independently.
penguins.groupby('species').max()
island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|
species | ||||||
Adelie | Torgersen | 46.0 | 21.5 | 210.0 | 4775.0 | Male |
Chinstrap | Dream | 58.0 | 20.8 | 212.0 | 4800.0 | Male |
Gentoo | Biscoe | 59.6 | 17.3 | 231.0 | 6300.0 | Male |
The above result tells us that:
'island'
among 'Adelie'
penguins is 'Torgersen'
.'body_mass_g'
among 'Adelie'
penguins is 4775.0
.It is not telling us that there is an 'Adelie'
penguin on 'Torgersen'
island with a 'body_mass_g'
of 4775.0
!
# This penguin is on Biscoe island, not Torgersen!
penguins.loc[(penguins['species'] == 'Adelie') & (penguins['body_mass_g'] == 4775.0)]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
109 | Adelie | Biscoe | 43.2 | 19.0 | 197.0 | 4775.0 | Male |
Find the 'island'
on which the heaviest penguin of each 'species'
lives.
penguins.sort_values('body_mass_g', ascending=False).groupby('species').first()
island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|
species | ||||||
Adelie | Biscoe | 43.2 | 19.0 | 197.0 | 4775.0 | Male |
Chinstrap | Dream | 52.0 | 20.7 | 210.0 | 4800.0 | Male |
Gentoo | Biscoe | 49.2 | 15.2 | 221.0 | 6300.0 | Male |
max
and min
are defined on strings, while median
and mean
are not.DataFrameGroupBy
objects support []
notation, just like DataFrame
s.penguins.groupby('species').mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
species | ||||
Adelie | 38.823973 | 18.347260 | 190.102740 | 3706.164384 |
Chinstrap | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
# Works, but involves wasted effort since the other columns had to be aggregated for no reason.
penguins.groupby('species').mean()['bill_length_mm']
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
# This is a SeriesGroupBy object!
penguins.groupby('species')['bill_length_mm']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7ff2cbc18f40>
# Saves time!
penguins.groupby('species')['bill_length_mm'].mean()
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
To demonstrate that the former is slower than the latter, we can use %%timeit
. For reference, we'll also include our earlier for
-loop-based solution.
%%timeit
penguins.groupby('species').mean()['bill_length_mm']
452 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
penguins.groupby('species')['bill_length_mm'].mean()
194 µs ± 901 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
species_map = pd.Series([], dtype=float)
for species in penguins['species'].unique():
species_only = penguins.loc[penguins['species'] == species]
species_map.loc[species] = species_only['body_mass_g'].mean()
species_map
1.3 ms ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It's important to understand what each piece of your code evaluates to – in the first two timed examples, the code is almost identical, but the performance is quite different.
# Slower
penguins.groupby('species').mean()['bill_length_mm']
# Faster
penguins.groupby('species')['bill_length_mm'].mean()
groupby
method is much quicker than for
-looping over the DataFrame in Python. It can often produce results using just a single, fast pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.pandas
?aggregate
method¶DataFrameGroupBy
object has a general aggregate
method, which aggregates using one or more operations.aggregate
; refer to the documentation for a comprehensive list.agg
is an alias for aggregate
.How many penguins are there of each 'species'
, and what is the mean 'body_mass_g'
of each species?
penguins.groupby('species')['body_mass_g'].aggregate(['count', 'mean'])
count | mean | |
---|---|---|
species | ||
Adelie | 146 | 3706.164384 |
Chinstrap | 68 | 3733.088235 |
Gentoo | 119 | 5092.436975 |
Note what happens when we don't select a column before aggregating.
penguins.groupby('species').aggregate(['count', 'mean'])
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||
---|---|---|---|---|---|---|---|---|
count | mean | count | mean | count | mean | count | mean | |
species | ||||||||
Adelie | 146 | 38.823973 | 146 | 18.347260 | 146 | 190.102740 | 146 | 3706.164384 |
Chinstrap | 68 | 48.833824 | 68 | 18.420588 | 68 | 195.823529 | 68 | 3733.088235 |
Gentoo | 119 | 47.568067 | 119 | 14.996639 | 119 | 217.235294 | 119 | 5092.436975 |
What is the maximum 'bill_length_mm'
of each species, and which 'island'
s is each 'species'
found on?
penguins.groupby('species').aggregate({'bill_length_mm': 'max', 'island': 'unique'})
bill_length_mm | island | |
---|---|---|
species | ||
Adelie | 46.0 | [Torgersen, Biscoe, Dream] |
Chinstrap | 58.0 | [Dream] |
Gentoo | 59.6 | [Biscoe] |
What is the interquartile range of the 'body_mass_g'
of each 'species'
?
(
penguins.groupby('species')['body_mass_g']
.aggregate(lambda col: np.percentile(col, 75) - np.percentile(col, 25))
)
species Adelie 637.5 Chinstrap 462.5 Gentoo 800.0 Name: body_mass_g, dtype: float64
DataFrameGroupBy
methods¶When we introduced the split-apply-combine pattern, the "apply" step involved aggregation – our final DataFrame had one row for each group.
Instead of aggregating during the apply step, we could instead perform a:
'body_mass_g'
column to to z-scores (i.e. standard units), separately for each 'species'
:transform
method on a DataFrameGroupBy
object. The transform
method takes in a function, which itself takes in a Series and returns a new Series.z_score = lambda x: (x - x.mean()) / x.std(ddof=0)
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
z_scored = penguins.groupby('species').transform(z_score)
z_scored.head()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 0.104025 | 0.290284 | -1.400540 | 0.095911 |
1 | 0.254772 | -0.779539 | -0.631244 | 0.205309 |
2 | 0.556265 | -0.285774 | 0.753488 | -0.998069 |
4 | -0.800453 | 0.784048 | 0.445770 | -0.560477 |
5 | 0.179399 | 1.853870 | -0.015807 | -0.122885 |
np.random.seed(1)
penguins.assign(z_mass=z_scored['body_mass_g']).sample(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | z_mass | |
---|---|---|---|---|---|---|---|---|
65 | Adelie | Biscoe | 41.6 | 18.0 | 192.0 | 3950.0 | Male | 0.533502 |
276 | Gentoo | Biscoe | 43.8 | 13.9 | 208.0 | 4300.0 | Female | -1.586890 |
186 | Chinstrap | Dream | 49.7 | 18.6 | 195.0 | 3600.0 | Male | -0.348856 |
198 | Chinstrap | Dream | 50.1 | 17.9 | 190.0 | 3400.0 | Female | -0.873105 |
293 | Gentoo | Biscoe | 46.5 | 14.8 | 217.0 | 5200.0 | Female | 0.215400 |
Note that below, penguin 276 has a larger 'body_mass_g'
than penguin 65, but a lower 'z_mass'
.
'body_mass_g'
among 'Adelie'
penguins.'body_mass_g'
among 'Gentoo'
penguins. Remember from earlier that the average 'body_mass_g'
of 'Gentoo'
penguins is much higher than for other species.filter
method on a DataFrameGroupBy
object.filter
method takes in a function, which itself takes in a DataFrame/Series and return a single Boolean. The result is a new DataFrame/Series with only the groups for which the filter function returned True
.For example, suppose we want only the 'species'
whose 'bill_length_mm'
is above 39.
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 39)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
No more 'Adelie'
s!
Or, as another example, suppose we only want the 'species'
with at least 100 penguins:
penguins.groupby('species').filter(lambda df: df.shape[0] > 100)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
265 rows × 7 columns
No more 'Chinstrap'
s!
When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.
species_and_island = penguins.groupby(['species', 'island']).mean()
species_and_island
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | ||
---|---|---|---|---|---|
species | island | ||||
Adelie | Biscoe | 38.975000 | 18.370455 | 188.795455 | 3709.659091 |
Dream | 38.520000 | 18.240000 | 189.927273 | 3701.363636 | |
Torgersen | 39.038298 | 18.451064 | 191.531915 | 3708.510638 | |
Chinstrap | Dream | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | Biscoe | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
groupby
method creates an index based on the specified columns.MultiIndex
.MultiIndex
, use reset_index
or set as_index=False
in groupby
.species_and_island
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | ||
---|---|---|---|---|---|
species | island | ||||
Adelie | Biscoe | 38.975000 | 18.370455 | 188.795455 | 3709.659091 |
Dream | 38.520000 | 18.240000 | 189.927273 | 3701.363636 | |
Torgersen | 39.038298 | 18.451064 | 191.531915 | 3708.510638 | |
Chinstrap | Dream | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
Gentoo | Biscoe | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
species_and_island['body_mass_g']
species island Adelie Biscoe 3709.659091 Dream 3701.363636 Torgersen 3708.510638 Chinstrap Dream 3733.088235 Gentoo Biscoe 5092.436975 Name: body_mass_g, dtype: float64
species_and_island.loc['Adelie']
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
island | ||||
Biscoe | 38.975000 | 18.370455 | 188.795455 | 3709.659091 |
Dream | 38.520000 | 18.240000 | 189.927273 | 3701.363636 |
Torgersen | 39.038298 | 18.451064 | 191.531915 | 3708.510638 |
species_and_island.loc[('Adelie', 'Torgersen')]
bill_length_mm 39.038298 bill_depth_mm 18.451064 flipper_length_mm 191.531915 body_mass_g 3708.510638 Name: (Adelie, Torgersen), dtype: float64
species_and_island.reset_index()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|---|---|
0 | Adelie | Biscoe | 38.975000 | 18.370455 | 188.795455 | 3709.659091 |
1 | Adelie | Dream | 38.520000 | 18.240000 | 189.927273 | 3701.363636 |
2 | Adelie | Torgersen | 39.038298 | 18.451064 | 191.531915 | 3708.510638 |
3 | Chinstrap | Dream | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
4 | Gentoo | Biscoe | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
penguins.groupby(['species', 'island'], as_index=False).mean()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|---|---|
0 | Adelie | Biscoe | 38.975000 | 18.370455 | 188.795455 | 3709.659091 |
1 | Adelie | Dream | 38.520000 | 18.240000 | 189.927273 | 3701.363636 |
2 | Adelie | Torgersen | 39.038298 | 18.451064 | 191.531915 | 3708.510638 |
3 | Chinstrap | Dream | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
4 | Gentoo | Biscoe | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
groupby
method returns a DataFrameGroupBy
method, which creates one group for every unique combination of values in the column(s) being grouped on.DataFrameGroupBy
object, but we can also use transform
, filter
, or the more general apply
methods. Each one of these methods acts on each group individually.pivot
and pivot_table
. Simpson's paradox.