# You'll start seeing this cell in most lectures.
# It exists to hide all of the import statements and other setup
# code we need in lecture notebooks.
from dsc80_utils import *
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
Lecture 2 – DataFrame Fundamentals¶
Agenda¶
numpyarrays.- From
babypandastopandas.- Deep dive into DataFrames.
- Accessing subsets of rows and columns in DataFrames.
.locand.iloc.- Querying (i.e. filtering).
- Adding and modifying columns.
pandasandnumpy.
We can't cover every single detail! The pandas documentation will be your friend.
numpy arrays¶
numpy overview¶
numpystands for "numerical Python". It is a commonly-used Python module that enables fast computation involving arrays and matrices.numpy's main object is the array. Innumpy, arrays are:- Homogenous – all values are of the same type.
- (Potentially) multi-dimensional.
- Computation in
numpyis fast because:- Much of it is implemented in C.
numpyarrays are stored more efficiently in memory than, say, Python lists.
- This site provides a good overview of
numpyarrays.
We use numpy to work with sequences of data:
arr = np.arange(10)
arr
# The shape (10,) means that the array only has a single dimension,
# of size 10.
arr.shape
2 ** arr
Arrays come equipped with several handy methods; some examples are below, but you can read about them all here.
(2 ** arr).sum()
(2 ** arr).mean()
(2 ** arr).max()
(2 ** arr).argmax()
⚠️ The dangers of for-loops¶
for-loops are slow when processing large datasets. You will rarely writefor-loops (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!- One of the biggest benefits of
numpyis that it supports vectorized operations.- If
aandbare two arrays of the same length, thena + bis a new array of the same length containing the element-wise sum ofaandb.
- If
- To illustrate how much faster
numpyarithmetic is than using afor-loop, let's compute the squares of the numbers between 0 and 1,000,000:- Using a
for-loop. - Using vectorized arithmetic, through
numpy.
- Using a
%%timeit
squares = []
for i in range(1_000_000):
squares.append(i * i)
In vanilla Python, this takes about 0.04 seconds per loop.
%%timeit
squares = np.arange(1_000_000) ** 2
In numpy, this only takes about 0.001 seconds per loop, more than 40x faster! Note that under the hood, numpy is also using a for-loop, but it's a for-loop implemented in C, which is much faster than Python.
Multi-dimensional arrays¶
While we didn't see these very often in DSC 10, multi-dimensional lists/arrays may have since come up in DSC 20, 30, or 40A (especially in the context of linear algebra).
We'll spend a bit of time talking about 2D (and 3D) arrays here, since in some ways, they behave similarly to DataFrames.
Below, we create a 2D array from scratch.
nums = np.array([
[5, 1, 9, 7],
[9, 8, 2, 3],
[2, 5, 0, 4]
])
nums
# nums has 3 rows and 4 columns.
nums.shape
We can also create 2D arrays by reshaping other arrays.
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = np.arange(1, 7).reshape((2, 3))
a
Operations along axes¶
In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).

a
If we specify axis=0, a.sum will "compress" along axis 0.
a.sum(axis=0)
If we specify axis=1, a.sum will "compress" along axis 1.
a.sum(axis=1)
Selecting rows and columns from 2D arrays¶
You can use [square brackets] to slice rows and columns out of an array, using the same slicing conventions you saw in DSC 20.
a
# Accesses row 0 and all columns.
a[0, :]
# Same as the above.
a[0]
# Accesses all rows and column 1.
a[:, 1]
# Accesses row 0 and columns 1 and onwards.
a[0, 1:]
Question 🤔
Try and predict the value of grid[-1, 1:].sum() without running the code below.
s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
# grid[-1, 1:].sum()
Example: Image processing¶
numpy arrays are homogenous and potentially multi-dimensional.
It turns out that images can be represented as 3D numpy arrays. The color of each pixel can be described with three numbers under the RGB model – a red value, green value, and blue value. Each of these can vary from 0 to 1.
(image source)from PIL import Image
img_path = Path('imgs') / 'bentley.jpg'
img = np.asarray(Image.open(img_path)) / 255
img
img.shape
plt.imshow(img)
plt.axis('off');
Applying a greyscale filter¶
One way to convert an image to greyscale is to average its red, green, and blue values.
mean_2d = img.mean(axis=2)
mean_2d
This is just a single red channel!
plt.imshow(mean_2d)
plt.axis('off');
We need to repeat mean_2d three times along axis 2, to use the same values for the red, green, and blue channels. np.repeat will help us here.
# np.newaxis is an alias for None.
# It helps us introduce an additional axis.
np.arange(5)[:, np.newaxis]
np.repeat(np.arange(5)[:, np.newaxis], 3, axis=1)
mean_3d = np.repeat(mean_2d[:, :, np.newaxis], 3, axis=2)
plt.imshow(mean_3d)
plt.axis('off');
Applying a sepia filter¶
Let's sepia-fy Junior!
(Image credits)
From here, we can apply this conversion to each pixel.
$$\begin{align*} R_{\text{sepia}} &= 0.393R + 0.769G + 0.189B \\ G_{\text{sepia}} &= 0.349R + 0.686G + 0.168B \\ B_{\text{sepia}} &= 0.272R + 0.534G + 0.131B\end{align*}$$sepia_filter = np.array([
[0.393, 0.769, 0.189],
[0.349, 0.686, 0.168],
[0.272, 0.534, 0.131]
])
# Multiplies each pixel by the sepia_filter matrix.
# Then, clips each RGB value to be between 0 and 1.
filtered = (img @ sepia_filter.T).clip(0, 1)
filtered
plt.imshow(filtered)
plt.axis('off');
Key takeaway: avoid for-loops whenever possible!¶
You can do a lot without for-loops, both in numpy and in pandas.
pandas and numpy¶

pandas is built upon numpy!¶
- A Series in
pandasis anumpyarray with an index. - A DataFrame is like a dictionary of columns, each of which is a
numpyarray. - Many operations in
pandasare fast because they usenumpy's implementations. - If you need to access the array underlying a DataFrame or Series, use the
to_numpymethod.
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs
dogs['lifetime_cost']
dogs['lifetime_cost'].to_numpy()
pandas data types¶
- Each Series (column) has a
numpydata type, which refers to the type of the values stored within. Access it using thedtypesattribute. - A column's data type determines which operations can be applied to it.
pandastries to guess the correct data types for a given DataFrame, and is often wrong.- This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.
dogs
dogs.dtypes
pandas data types¶
Notice that Python str types are object types in numpy and pandas.
| Pandas dtype | Python type | NumPy type | SQL type | Usage |
|---|---|---|---|---|
| int64 | int | int_, int8,...,int64, uint8,...,uint64 | INT, BIGINT | Integer numbers |
| float64 | float | float_, float16, float32, float64 | FLOAT | Floating point numbers |
| bool | bool | bool_ | BOOL | True/False values |
| datetime64 or Timestamp | datetime.datetime | datetime64 | DATETIME | Date and time values |
| timedelta64 or Timedelta | datetime.timedelta | timedelta64 | NA | Differences between two datetimes |
| category | NA | NA | ENUM | Finite list of text values |
| object | str | string, unicode | NA | Text |
| object | NA | object | NA | Mixed types |
This article details how pandas stores different data types under the hood.
This article explains how numpy/pandas int64 operations differ from vanilla int operations.
Type conversion¶
You can change the data type of a Series using the .astype Series method.
For example, we can change the data type of the 'lifetime_cost' column in dogs to be uint32:
dogs
# Gives the types as well as the space taken up by the DataFrame.
dogs.info()
dogs = dogs.assign(lifetime_cost=dogs['lifetime_cost'].astype('uint32'))
Now, the DataFrame takes up less space! This may be insignificant in our DataFrame, but makes a difference when working with larger datasets.
dogs.info()
💡 Pro-Tip: Setting dtypes in read_csv¶
Usually, we prefer to set the correct dtypes in read_csv, since it can help pandas load in files more quickly:
dog_path
dogs = pd.read_csv(dog_path, dtype={'lifetime_cost': 'uint32'})
dogs
dogs.dtypes
Axes¶
- The rows and columns of a DataFrame are both stored as Series.
- The axis specifies the direction of a slice of a DataFrame.

- Axis 0 refers to the index (rows).
- Axis 1 refers to the columns.
- These are the same axes definitions that 2D
numpyarrays have!
DataFrame methods with axis¶
- Many Series methods work on DataFrames.
- In such cases, the DataFrame method usually applies the Series method to every row or column.
- Many of these methods accept an
axisargument; the default is usuallyaxis=0.
dogs
# Max element in each column.
dogs.max()
# Max element in each row – throws an error since there are different types in each row.
# dogs.max(axis=1)
# The number of unique values in each column.
dogs.nunique()
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe()
Exercise
Pick a dog breed that you personally like or know the name of. Then:- Try to find a few other dog breeds that are similar in weight to yours in
all_dogs. - Which similar breeds have the lowest and highest
'lifetime_cost'?'intelligence_rank'? - Are there any similar breeds that you haven't heard of before?
For fun, look up these dog breeds on the AKC website to see what they look like!
all_dogs = pd.read_csv(Path('data') / 'all_dogs.csv')
all_dogs
Data granularity and the groupby method¶
Example: Palmer Penguins¶
Artwork by @allison_horst
The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&start=11',
width=560, height=315)
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins
Here, each row corresponds to a single penguin, and each column corresponds to a different attribute (or feature) we have for each penguin. Data formatted in this way is called tidy data.
Granularity¶
- Granularity refers to what each observation in a dataset represents.
- Fine: small details.
- Coarse: bigger picture.
- If you can control how your dataset is created, you should opt for finer granularity, i.e. for more detail.
- You can always remove details, but it's difficult to add detail that isn't already there.
- But obtaining fine-grained data can take more time/money.
- Today, we'll focus on how to remove details from fine-grained data, in order to help us understand bigger-picture trends in our data.
Aggregating¶
Aggregating is the act of combining many values into a single value.
- What is the mean
'body_mass_g'for all penguins?
penguins['body_mass_g'].mean()
- What is the mean
'body_mass_g'for each species?
# ???
Naive approach: looping through unique values¶
species_map = pd.Series([], dtype=float)
for species in penguins['species'].unique():
species_only = penguins.loc[penguins['species'] == species]
species_map.loc[species] = species_only['body_mass_g'].mean()
species_map
- For each unique
'species', we make a pass through the entire dataset.- The asymptotic runtime of this procedure is $\Theta(ns)$, where $n$ is the number of rows and $s$ is the number of unique species.
- While there are other loop-based solutions that only involve a single pass over the DataFrame, we'd like to avoid Python loops entirely, as they're slow.
Grouping¶
A better solution is to use the groupby method.
# Before:
penguins['body_mass_g'].mean()
# After:
penguins.groupby('species')['body_mass_g'].mean()
Somehow, the groupby method computes what we're looking for in just one line. How?
%%pt
penguins.groupby('species')['body_mass_g'].mean()
"Split-apply-combine" paradigm¶
The groupby method involves three steps: split, apply, and combine. This is the same terminology that the pandas documentation uses.

Split breaks up and "groups" the rows of a DataFrame according to the specified key. There is one "group" for every unique value of the key.
Apply uses a function (e.g. aggregation, transformation, filtration) within the individual groups.
Combine stitches the results of these operations into an output DataFrame.
The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.
More examples¶
Before we dive into the internals, let's look at a few more examples.
Question 🤔
What proportion of penguins of each 'species' live on 'Dream' island?
Your output should look like:
species
Adelie 0.38
Chinstrap 1.00
Gentoo 0.00
# Fill this in, then respond on dsc80.com/q
DataFrameGroupBy objects and aggregation¶
DataFrameGroupBy objects¶
We've just evaluated a few expressions of the following form.
penguins
penguins.groupby('species')['bill_length_mm'].mean()
There are two method calls in the expression above: .groupby('species') and .mean(). What happens in the .groupby() call?
penguins.groupby('species')
Peeking under the hood¶
If df is a DataFrame, then df.groupby(key) returns a DataFrameGroupBy object.
This object represents the "split" in "split-apply-combine".
# Simplified DataFrame for demonstration:
penguins_small = penguins.iloc[[0, 150, 300, 1, 251, 151, 301], [0, 5, 6]]
penguins_small
# Creates one group for each unique value in the species column.
penguin_groups = penguins_small.groupby('species')
penguin_groups
%%pt
penguin_groups
DataFrameGroupBy objects have a groups attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.
penguin_groups.groups
DataFrameGroupBy objects also have a get_group(key) method, which returns a DataFrame with only the values for the given key.
penguin_groups.get_group('Chinstrap')
# Same as the above!
penguins_small.query('species == "Chinstrap"')
We usually don't use these attributes and methods, but they're useful in understanding how groupby works under the hood.
Aggregation¶
Once we create a
DataFrameGroupByobject, we need to apply some function to each group, and combine the results.The most common operation we apply to each group is an aggregation.
- Remember, aggregation is the act of combining many values into a single value.
To perform an aggregation, use an aggregation method on the
DataFrameGroupByobject, e.g..mean(),.max(), or.median().
Let's look at some examples.
penguins_small
penguins_small.groupby('species')['body_mass_g'].mean()
# Whoa, what happened in the sex column?
penguins_small.groupby('species').sum()
penguins_small.groupby('species').last()
penguins_small.groupby('species').max()
Column independence¶
Within each group, the aggregation method is applied to each column independently.
penguins_small.groupby('species').max()
It is not telling us that there is a 'Male' 'Adelie' penguin with a 'body_mass_g' of 3800.0!
# This penguin is Female!
penguins_small.loc[(penguins['species'] == 'Adelie') & (penguins['body_mass_g'] == 3800.0)]
Question 🤔
Find the species, island, and body_mass_g of the heaviest Male and Female penguins in penguins (not penguins_small).
# Your code goes here.
Column selection and performance implications¶
- By default, the aggregator will be applied to all columns that it can be applied to.
max,min, andsumare defined on strings, whilemedianandmeanare not.
- If we only care about one column, we can select that column before aggregating to save time.
DataFrameGroupByobjects support[]notation, just likeDataFrames.
# Back to the big penguins dataset!
penguins
# Works, but involves wasted effort since the other columns had to be aggregated for no reason.
penguins.groupby('species').sum()['bill_length_mm']
# This is a SeriesGroupBy object!
penguins.groupby('species')['bill_length_mm']
# Saves time!
penguins.groupby('species')['bill_length_mm'].sum()
To demonstrate that the former is slower than the latter, we can use %%timeit. For reference, we'll also include our earlier for-loop-based solution.
%%timeit
penguins.groupby('species').sum()['bill_length_mm']
%%timeit
penguins.groupby('species')['bill_length_mm'].sum()
%%timeit
species_map = pd.Series([], dtype=float)
for species in penguins['species'].unique():
species_only = penguins.loc[penguins['species'] == species]
species_map.loc[species] = species_only['body_mass_g'].mean()
species_map
Takeaways¶
It's important to understand what each piece of your code evaluates to – in the first two timed examples, the code is almost identical, but the performance is quite different.
# Slower penguins.groupby('species').sum()['bill_length_mm'] # Faster penguins.groupby('species')['bill_length_mm'].sum()
The
groupbymethod is much quicker thanfor-looping over the DataFrame in Python. It can often produce results using just a single, fast pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.You should always select the columns you want after
groupby, unless you really know what you're doing!
Beyond default aggregation methods¶
- There are many built-in aggregation methods.
- What if you want to apply different aggregation methods to different columns?
- What if the aggregation method you want to use doesn't already exist in
pandas?
The aggregate method¶
- The
DataFrameGroupByobject has a generalaggregatemethod, which aggregates using one or more operations.- Remember, aggregation is the act of combining many values into a single value.
- There are many ways of using
aggregate; refer to the documentation for a comprehensive list. - Example arguments:
- A single function.
- A list of functions.
- A dictionary mapping column names to functions.
- Per the documentation,
aggis an alias foraggregate.
Example¶
How many penguins are there of each 'species', and what is the mean 'body_mass_g' of each 'species'?
(penguins
.groupby('species')
['body_mass_g']
.aggregate(['count', 'mean'])
)
Example¶
What is the maximum 'bill_length_mm' of each 'species', and which 'island's is each 'species' found on?
(penguins
.groupby('species')
.aggregate({'bill_length_mm': 'max', 'island': 'unique'})
)
Example¶
What is the interquartile range of the 'body_mass_g' of each 'species'?
# Here, the argument to agg is a function,
# which takes in a pd.Series and returns a scalar.
def iqr(s):
return np.percentile(s, 75) - np.percentile(s, 25)
(penguins
.groupby('species')
['body_mass_g']
.agg(iqr)
)