In [ ]:
# You'll start seeing this cell in most lectures.
# It exists to hide all of the import statements and other setup
# code we need in lecture notebooks.
from dsc80_utils import *
In [ ]:
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}

Lecture 2 – DataFrame Fundamentals¶

Agenda¶

  • numpy arrays.
  • From babypandas to pandas.
    • Deep dive into DataFrames.
  • Accessing subsets of rows and columns in DataFrames.
    • .loc and .iloc.
    • Querying (i.e. filtering).
  • Adding and modifying columns.
  • pandas and numpy.

We can't cover every single detail! The pandas documentation will be your friend.

numpy arrays¶

numpy overview¶

  • numpy stands for "numerical Python". It is a commonly-used Python module that enables fast computation involving arrays and matrices.
  • numpy's main object is the array. In numpy, arrays are:
    • Homogenous – all values are of the same type.
    • (Potentially) multi-dimensional.
  • Computation in numpy is fast because:
    • Much of it is implemented in C.
    • numpy arrays are stored more efficiently in memory than, say, Python lists.
  • This site provides a good overview of numpy arrays.

We use numpy to work with sequences of data:

In [ ]:
arr = np.arange(10)
arr
In [ ]:
# The shape (10,) means that the array only has a single dimension,
# of size 10.
arr.shape
In [ ]:
2 ** arr

Arrays come equipped with several handy methods; some examples are below, but you can read about them all here.

In [ ]:
(2 ** arr).sum()
In [ ]:
(2 ** arr).mean()
In [ ]:
(2 ** arr).max()
In [ ]:
(2 ** arr).argmax()

⚠️ The dangers of for-loops¶

  • for-loops are slow when processing large datasets. You will rarely write for-loops (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!
  • One of the biggest benefits of numpy is that it supports vectorized operations.
    • If a and b are two arrays of the same length, then a + b is a new array of the same length containing the element-wise sum of a and b.
  • To illustrate how much faster numpy arithmetic is than using a for-loop, let's compute the squares of the numbers between 0 and 1,000,000:
    • Using a for-loop.
    • Using vectorized arithmetic, through numpy.
In [ ]:
%%timeit
squares = []
for i in range(1_000_000):
    squares.append(i * i)

In vanilla Python, this takes about 0.04 seconds per loop.

In [ ]:
%%timeit
squares = np.arange(1_000_000) ** 2

In numpy, this only takes about 0.001 seconds per loop, more than 40x faster! Note that under the hood, numpy is also using a for-loop, but it's a for-loop implemented in C, which is much faster than Python.

Multi-dimensional arrays¶

While we didn't see these very often in DSC 10, multi-dimensional lists/arrays may have since come up in DSC 20, 30, or 40A (especially in the context of linear algebra).

We'll spend a bit of time talking about 2D (and 3D) arrays here, since in some ways, they behave similarly to DataFrames.

Below, we create a 2D array from scratch.

In [ ]:
nums = np.array([
    [5, 1, 9, 7],
    [9, 8, 2, 3],
    [2, 5, 0, 4]
])

nums
In [ ]:
# nums has 3 rows and 4 columns.
nums.shape

We can also create 2D arrays by reshaping other arrays.

In [ ]:
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = np.arange(1, 7).reshape((2, 3))
a

Operations along axes¶

In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).

No description has been provided for this image
In [ ]:
a

If we specify axis=0, a.sum will "compress" along axis 0.

In [ ]:
a.sum(axis=0)

If we specify axis=1, a.sum will "compress" along axis 1.

In [ ]:
a.sum(axis=1)

Selecting rows and columns from 2D arrays¶

You can use [square brackets] to slice rows and columns out of an array, using the same slicing conventions you saw in DSC 20.

In [ ]:
a
In [ ]:
# Accesses row 0 and all columns.
a[0, :]
In [ ]:
# Same as the above.
a[0]
In [ ]:
# Accesses all rows and column 1.
a[:, 1]
In [ ]:
# Accesses row 0 and columns 1 and onwards.
a[0, 1:]

Question 🤔

Try and predict the value of grid[-1, 1:].sum() without running the code below.

In [ ]:
s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
# grid[-1, 1:].sum()

Example: Image processing¶

numpy arrays are homogenous and potentially multi-dimensional.

It turns out that images can be represented as 3D numpy arrays. The color of each pixel can be described with three numbers under the RGB model – a red value, green value, and blue value. Each of these can vary from 0 to 1.

No description has been provided for this image(image source)
In [ ]:
from PIL import Image
img_path = Path('imgs') / 'bentley.jpg'
img = np.asarray(Image.open(img_path)) / 255
In [ ]:
img
In [ ]:
img.shape
In [ ]:
plt.imshow(img)
plt.axis('off');

Applying a greyscale filter¶

One way to convert an image to greyscale is to average its red, green, and blue values.

In [ ]:
mean_2d = img.mean(axis=2)
mean_2d

This is just a single red channel!

In [ ]:
plt.imshow(mean_2d)
plt.axis('off');

We need to repeat mean_2d three times along axis 2, to use the same values for the red, green, and blue channels. np.repeat will help us here.

In [ ]:
# np.newaxis is an alias for None.
# It helps us introduce an additional axis.
np.arange(5)[:, np.newaxis]
In [ ]:
np.repeat(np.arange(5)[:, np.newaxis], 3, axis=1)
In [ ]:
mean_3d = np.repeat(mean_2d[:, :, np.newaxis], 3, axis=2)
In [ ]:
plt.imshow(mean_3d)
plt.axis('off');

Applying a sepia filter¶

Let's sepia-fy Junior!

No description has been provided for this image (Image credits)

From here, we can apply this conversion to each pixel.

$$\begin{align*} R_{\text{sepia}} &= 0.393R + 0.769G + 0.189B \\ G_{\text{sepia}} &= 0.349R + 0.686G + 0.168B \\ B_{\text{sepia}} &= 0.272R + 0.534G + 0.131B\end{align*}$$
In [ ]:
sepia_filter = np.array([
    [0.393, 0.769, 0.189],
    [0.349, 0.686, 0.168],
    [0.272, 0.534, 0.131]
])
In [ ]:
# Multiplies each pixel by the sepia_filter matrix.
# Then, clips each RGB value to be between 0 and 1.
filtered = (img @ sepia_filter.T).clip(0, 1)
filtered
In [ ]:
plt.imshow(filtered)
plt.axis('off');

Key takeaway: avoid for-loops whenever possible!¶

You can do a lot without for-loops, both in numpy and in pandas.

pandas and numpy¶

No description has been provided for this image

pandas is built upon numpy!¶

  • A Series in pandas is a numpy array with an index.
  • A DataFrame is like a dictionary of columns, each of which is a numpy array.
  • Many operations in pandas are fast because they use numpy's implementations.
  • If you need to access the array underlying a DataFrame or Series, use the to_numpy method.
In [ ]:
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs
In [ ]:
dogs['lifetime_cost']
In [ ]:
dogs['lifetime_cost'].to_numpy()

pandas data types¶

  • Each Series (column) has a numpy data type, which refers to the type of the values stored within. Access it using the dtypes attribute.
  • A column's data type determines which operations can be applied to it.
  • pandas tries to guess the correct data types for a given DataFrame, and is often wrong.
    • This can lead to incorrect calculations and poor memory/time performance.
  • As a result, you will often need to explicitly convert between data types.
In [ ]:
dogs
In [ ]:
dogs.dtypes

pandas data types¶

Notice that Python str types are object types in numpy and pandas.

Pandas dtype Python type NumPy type SQL type Usage
int64 int int_, int8,...,int64, uint8,...,uint64 INT, BIGINT Integer numbers
float64 float float_, float16, float32, float64 FLOAT Floating point numbers
bool bool bool_ BOOL True/False values
datetime64 or Timestamp datetime.datetime datetime64 DATETIME Date and time values
timedelta64 or Timedelta datetime.timedelta timedelta64 NA Differences between two datetimes
category NA NA ENUM Finite list of text values
object str string, unicode NA Text
object NA object NA Mixed types

This article details how pandas stores different data types under the hood.

This article explains how numpy/pandas int64 operations differ from vanilla int operations.

Type conversion¶

You can change the data type of a Series using the .astype Series method.

For example, we can change the data type of the 'lifetime_cost' column in dogs to be uint32:

In [ ]:
dogs
In [ ]:
# Gives the types as well as the space taken up by the DataFrame.
dogs.info()
In [ ]:
dogs = dogs.assign(lifetime_cost=dogs['lifetime_cost'].astype('uint32'))

Now, the DataFrame takes up less space! This may be insignificant in our DataFrame, but makes a difference when working with larger datasets.

In [ ]:
dogs.info()

💡 Pro-Tip: Setting dtypes in read_csv¶

Usually, we prefer to set the correct dtypes in read_csv, since it can help pandas load in files more quickly:

In [ ]:
dog_path
In [ ]:
dogs = pd.read_csv(dog_path, dtype={'lifetime_cost': 'uint32'})
dogs
In [ ]:
dogs.dtypes

Axes¶

  • The rows and columns of a DataFrame are both stored as Series.
  • The axis specifies the direction of a slice of a DataFrame.
No description has been provided for this image
  • Axis 0 refers to the index (rows).
  • Axis 1 refers to the columns.
  • These are the same axes definitions that 2D numpy arrays have!

DataFrame methods with axis¶

  • Many Series methods work on DataFrames.
  • In such cases, the DataFrame method usually applies the Series method to every row or column.
  • Many of these methods accept an axis argument; the default is usually axis=0.
In [ ]:
dogs
In [ ]:
# Max element in each column.
dogs.max()
In [ ]:
# Max element in each row – throws an error since there are different types in each row.
# dogs.max(axis=1)
In [ ]:
# The number of unique values in each column.
dogs.nunique()
In [ ]:
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe()

Exercise

Pick a dog breed that you personally like or know the name of. Then:
  • Try to find a few other dog breeds that are similar in weight to yours in all_dogs.
  • Which similar breeds have the lowest and highest 'lifetime_cost'? 'intelligence_rank'?
  • Are there any similar breeds that you haven't heard of before?

For fun, look up these dog breeds on the AKC website to see what they look like!
In [ ]:
all_dogs = pd.read_csv(Path('data') / 'all_dogs.csv')
all_dogs

Data granularity and the groupby method¶

Example: Palmer Penguins¶

No description has been provided for this image Artwork by @allison_horst

The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.

In [ ]:
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&start=11',
       width=560, height=315)
In [ ]:
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins

Here, each row corresponds to a single penguin, and each column corresponds to a different attribute (or feature) we have for each penguin. Data formatted in this way is called tidy data.

Granularity¶

  • Granularity refers to what each observation in a dataset represents.
    • Fine: small details.
    • Coarse: bigger picture.
  • If you can control how your dataset is created, you should opt for finer granularity, i.e. for more detail.
    • You can always remove details, but it's difficult to add detail that isn't already there.
    • But obtaining fine-grained data can take more time/money.
  • Today, we'll focus on how to remove details from fine-grained data, in order to help us understand bigger-picture trends in our data.

Aggregating¶

Aggregating is the act of combining many values into a single value.

  • What is the mean 'body_mass_g' for all penguins?
In [ ]:
penguins['body_mass_g'].mean()
  • What is the mean 'body_mass_g' for each species?
In [ ]:
# ???

Naive approach: looping through unique values¶

In [ ]:
species_map = pd.Series([], dtype=float)

for species in penguins['species'].unique():
    species_only = penguins.loc[penguins['species'] == species]
    species_map.loc[species] = species_only['body_mass_g'].mean()

species_map
  • For each unique 'species', we make a pass through the entire dataset.
    • The asymptotic runtime of this procedure is $\Theta(ns)$, where $n$ is the number of rows and $s$ is the number of unique species.
  • While there are other loop-based solutions that only involve a single pass over the DataFrame, we'd like to avoid Python loops entirely, as they're slow.

Grouping¶

A better solution is to use the groupby method.

In [ ]:
# Before:
penguins['body_mass_g'].mean()
In [ ]:
# After:
penguins.groupby('species')['body_mass_g'].mean()

Somehow, the groupby method computes what we're looking for in just one line. How?

In [ ]:
%%pt

penguins.groupby('species')['body_mass_g'].mean()

"Split-apply-combine" paradigm¶

The groupby method involves three steps: split, apply, and combine. This is the same terminology that the pandas documentation uses.

No description has been provided for this image
  • Split breaks up and "groups" the rows of a DataFrame according to the specified key. There is one "group" for every unique value of the key.

  • Apply uses a function (e.g. aggregation, transformation, filtration) within the individual groups.

  • Combine stitches the results of these operations into an output DataFrame.

  • The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.

More examples¶

Before we dive into the internals, let's look at a few more examples.

Question 🤔

What proportion of penguins of each 'species' live on 'Dream' island?

Your output should look like:

species
Adelie       0.38
Chinstrap    1.00
Gentoo       0.00
In [ ]:
# Fill this in, then respond on dsc80.com/q

DataFrameGroupBy objects and aggregation¶

DataFrameGroupBy objects¶

We've just evaluated a few expressions of the following form.

In [ ]:
penguins
In [ ]:
penguins.groupby('species')['bill_length_mm'].mean()

There are two method calls in the expression above: .groupby('species') and .mean(). What happens in the .groupby() call?

In [ ]:
penguins.groupby('species')

Peeking under the hood¶

If df is a DataFrame, then df.groupby(key) returns a DataFrameGroupBy object.

This object represents the "split" in "split-apply-combine".

In [ ]:
# Simplified DataFrame for demonstration:
penguins_small = penguins.iloc[[0, 150, 300, 1, 251, 151, 301], [0, 5, 6]]
penguins_small
In [ ]:
# Creates one group for each unique value in the species column.
penguin_groups = penguins_small.groupby('species')
penguin_groups
In [ ]:
%%pt
penguin_groups

DataFrameGroupBy objects have a groups attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.

In [ ]:
penguin_groups.groups

DataFrameGroupBy objects also have a get_group(key) method, which returns a DataFrame with only the values for the given key.

In [ ]:
penguin_groups.get_group('Chinstrap')
In [ ]:
# Same as the above!
penguins_small.query('species == "Chinstrap"')

We usually don't use these attributes and methods, but they're useful in understanding how groupby works under the hood.

Aggregation¶

  • Once we create a DataFrameGroupBy object, we need to apply some function to each group, and combine the results.

  • The most common operation we apply to each group is an aggregation.

    • Remember, aggregation is the act of combining many values into a single value.
  • To perform an aggregation, use an aggregation method on the DataFrameGroupBy object, e.g. .mean(), .max(), or .median().

Let's look at some examples.

In [ ]:
penguins_small
In [ ]:
penguins_small.groupby('species')['body_mass_g'].mean()
In [ ]:
# Whoa, what happened in the sex column?
penguins_small.groupby('species').sum()
In [ ]:
penguins_small.groupby('species').last()
In [ ]:
penguins_small.groupby('species').max()

Column independence¶

Within each group, the aggregation method is applied to each column independently.

In [ ]:
penguins_small.groupby('species').max()

It is not telling us that there is a 'Male' 'Adelie' penguin with a 'body_mass_g' of 3800.0!

In [ ]:
# This penguin is Female!
penguins_small.loc[(penguins['species'] == 'Adelie') & (penguins['body_mass_g'] == 3800.0)]

Question 🤔

Find the species, island, and body_mass_g of the heaviest Male and Female penguins in penguins (not penguins_small).

In [ ]:
# Your code goes here.

Column selection and performance implications¶

  • By default, the aggregator will be applied to all columns that it can be applied to.
    • max, min, and sum are defined on strings, while median and mean are not.
  • If we only care about one column, we can select that column before aggregating to save time.
    • DataFrameGroupBy objects support [] notation, just like DataFrames.
In [ ]:
# Back to the big penguins dataset!
penguins
In [ ]:
# Works, but involves wasted effort since the other columns had to be aggregated for no reason.
penguins.groupby('species').sum()['bill_length_mm']
In [ ]:
# This is a SeriesGroupBy object!
penguins.groupby('species')['bill_length_mm']
In [ ]:
# Saves time!
penguins.groupby('species')['bill_length_mm'].sum()

To demonstrate that the former is slower than the latter, we can use %%timeit. For reference, we'll also include our earlier for-loop-based solution.

In [ ]:
%%timeit
penguins.groupby('species').sum()['bill_length_mm']
In [ ]:
%%timeit
penguins.groupby('species')['bill_length_mm'].sum()
In [ ]:
%%timeit
species_map = pd.Series([], dtype=float)

for species in penguins['species'].unique():
    species_only = penguins.loc[penguins['species'] == species]
    species_map.loc[species] = species_only['body_mass_g'].mean()

species_map

Takeaways¶

  • It's important to understand what each piece of your code evaluates to – in the first two timed examples, the code is almost identical, but the performance is quite different.

    # Slower
    penguins.groupby('species').sum()['bill_length_mm']
    
    # Faster
    penguins.groupby('species')['bill_length_mm'].sum()
    
  • The groupby method is much quicker than for-looping over the DataFrame in Python. It can often produce results using just a single, fast pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.

  • You should always select the columns you want after groupby, unless you really know what you're doing!

Beyond default aggregation methods¶

  • There are many built-in aggregation methods.
  • What if you want to apply different aggregation methods to different columns?
  • What if the aggregation method you want to use doesn't already exist in pandas?

The aggregate method¶

  • The DataFrameGroupBy object has a general aggregate method, which aggregates using one or more operations.
    • Remember, aggregation is the act of combining many values into a single value.
  • There are many ways of using aggregate; refer to the documentation for a comprehensive list.
  • Example arguments:
    • A single function.
    • A list of functions.
    • A dictionary mapping column names to functions.
  • Per the documentation, agg is an alias for aggregate.

Example¶

How many penguins are there of each 'species', and what is the mean 'body_mass_g' of each 'species'?

In [ ]:
(penguins
 .groupby('species')
 ['body_mass_g']
 .aggregate(['count', 'mean'])
)

Example¶

What is the maximum 'bill_length_mm' of each 'species', and which 'island's is each 'species' found on?

In [ ]:
(penguins
 .groupby('species')
 .aggregate({'bill_length_mm': 'max', 'island': 'unique'})
)

Example¶

What is the interquartile range of the 'body_mass_g' of each 'species'?

In [ ]:
# Here, the argument to agg is a function,
# which takes in a pd.Series and returns a scalar.

def iqr(s):
    return np.percentile(s, 75) - np.percentile(s, 25)

(penguins
 .groupby('species')
 ['body_mass_g']
 .agg(iqr)
)
In [ ]: