import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option("display.max_rows", 8)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
Announcements 📣¶
- The Welcome Survey is due tonight at 11:59pm.
- Lab 1 is released, and is due next Monday, Oct 9th at 11:59PM!
- See the Tech Support page for instructions and watch this video 🎥 for tips on how to set up your environment and work on assignments.
- Please try to set up your computer ASAP, since we have OH on Friday but not over the weekend to help debug your environment.
- You may use a slip day, in which case the due date will be Oct 10th.
- Discussion tomorrow will talk about what a conda environment is, and how to debug package import issues on your own.
- Lecture recordings are available here .
Agenda¶
- Whirlwind review of
numpyandbabypandas. pandasDataFrame objects.- Subsetting dataframes
.loc,.iloc, filtering/querying
Can't cover every single detail! The pandas documentation will be your friend.
Review: numpy¶
- NumPy stands for "numerical Python". It is a commonly-used Python module that enables fast computation involving arrays and matrices.
numpy's main object is the array. Innumpy, arrays are:- Homogenous – all values are of the same type.
- (Potentially) multi-dimensional.
- Computation in
numpyis fast because:- Much of it is implemented in C.
numpyarrays are stored more efficiently in memory than, say, Python lists.
- This site provides a good overview of
numpyarrays.
We used numpy in DSC 10 to work with sequences of data:
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
2 ** arr
array([ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512])
⚠️ The dangers of for-loops¶
for-loops are slow when processing large datasets. You will rarely writefor-loops in DSC 80 (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!- One of the biggest benefits of
numpyis that it supports vectorized operations.- If
aandbare two arrays of the same length, thena + bis a new array of the same length containing the element-wise sum ofaandb.
- If
- To illustrate how much faster
numpyarithmetic is than using afor-loop, let's compute the squares of the numbers between 0 and 1,000,000:- Using a
for-loop. - Using vectorized arithmetic, through
numpy.
- Using a
%%timeit
squares = []
for i in range(1_000_000):
squares.append(i * i)
65 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
squares = np.arange(1_000_000) ** 2
426 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
- Python: takes about 0.06 seconds per loop
numpy: takes about 0.0004 seconds per loop, more than 100x faster!
Introduction to pandas 🐼¶
pandas¶

pandasis the Python library for tabular data manipulation.- Before
pandaswas developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project. - Wes McKinney, the original developer of
pandas, wanted a library which would allow everything to be done in Python.- Python is faster to develop in than Java, and is more general-purpose than R.
pandas data structures¶
There are three key data structures at the core of pandas:
- DataFrame: 2 dimensional tables.
- Series: 1 dimensional array-like object, typically representing a column or row.
- Index: sequence of column or row labels.

Importing pandas and related libraries¶
pandas is almost always imported in conjunction with numpy:
import pandas as pd
import numpy as np
But...¶
The data are no longer available! One website has a slightly different version: https://tmfilho.github.io/akcdata/
We'll use the version that Sam saved while the data were still online.
all_dogs = pd.read_csv('data/all_dogs.csv')
all_dogs
| breed | group | datadog | popularity_all | ... | megarank | size | weight | height | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Border Collie | herding | 3.64 | 45 | ... | 29.0 | medium | NaN | 20.0 |
| 1 | Border Terrier | terrier | 3.61 | 80 | ... | 1.0 | small | 13.5 | NaN |
| 2 | Brittany | sporting | 3.54 | 30 | ... | 11.0 | medium | 35.0 | 19.0 |
| 3 | Cairn Terrier | terrier | 3.53 | 59 | ... | 2.0 | small | 14.0 | 10.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 168 | Welsh Terrier | terrier | NaN | 99 | ... | NaN | small | 20.0 | 15.0 |
| 169 | Wire Fox Terrier | terrier | NaN | 100 | ... | NaN | small | 17.5 | 15.0 |
| 170 | Wirehaired Pointing Griffon | sporting | NaN | 92 | ... | NaN | medium | NaN | 22.0 |
| 171 | Xoloitzcuintli | non-sporting | NaN | 155 | ... | NaN | medium | NaN | 16.5 |
172 rows × 18 columns
all_dogs.columns
Index(['breed', 'group', 'datadog', 'popularity_all', 'popularity',
'lifetime_cost', 'intelligence_rank', 'longevity', 'ailments', 'price',
'food_cost', 'grooming', 'kids', 'megarank_kids', 'megarank', 'size',
'weight', 'height'],
dtype='object')
Discussion Question¶
Let's refresh your DSC 10 knowledge! Find the most popular and least popular dog breeds using the popularity_all column.
# Fill in this cell
A Smaller Dogs Dataframe¶
The all_dogs dataframe is a bit large, so we have a smaller version here to make it easier to show pandas functionality.
dogs = pd.read_csv('data/dogs43.csv')
dogs
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| 1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| 2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| 3 | Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 39 | Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| 40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| 41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| 42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
Review: head, tail, shape, index, get, sort_values¶
To extract the first or last few rows of a DataFrame, use the head or tail methods.
dogs.head(3)
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| 1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| 2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
dogs.tail(2)
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| 42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
The shape attribute returns the DataFrame's number of rows and columns.
dogs.shape
(43, 7)
# Index is 0, 1, 2, ..., 171
dogs['breed']
0 Brittany
1 Cairn Terrier
2 English Cocker Spaniel
3 Cocker Spaniel
...
39 Bloodhound
40 Bullmastiff
41 Mastiff
42 Saint Bernard
Name: breed, Length: 43, dtype: object
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index
RangeIndex(start=0, stop=43, step=1)
We know that we can use .get() to select out a few columns...
# This is review from DSC 10 but most people don't use .get() in practice.
# Will cover in just a few minutes...
dogs.get(['breed', 'kind', 'longevity'])
| breed | kind | longevity | |
|---|---|---|---|
| 0 | Brittany | sporting | 12.92 |
| 1 | Cairn Terrier | terrier | 13.84 |
| 2 | English Cocker Spaniel | sporting | 11.66 |
| 3 | Cocker Spaniel | sporting | 12.50 |
| ... | ... | ... | ... |
| 39 | Bloodhound | hound | 6.75 |
| 40 | Bullmastiff | working | 7.57 |
| 41 | Mastiff | working | 6.50 |
| 42 | Saint Bernard | working | 7.78 |
43 rows × 3 columns
To sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.
dogs.sort_values('longevity', ascending=False)
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 8 | Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.0 |
| 12 | Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small | 12.0 | 10.0 |
| 6 | Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 | 10.5 |
| 1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
| 40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| 39 | Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| 41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
43 rows × 7 columns
Setting the index¶
Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# There used to be 7 columns, but now there are only 6!
dogs.shape
(43, 6)
💡 Pro-tip: Displaying more rows/columns¶
Sometimes, you just want pandas to display a lot of rows and columns. You can use this helper function to do that:
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
"""Displays n rows and cols from df"""
with pd.option_context("display.max_rows", rows,
"display.max_columns", cols):
display(df)
display_df(dogs, rows=43)
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.00 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.00 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.00 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.50 |
| Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.50 |
| Siberian Husky | working | 22049.0 | 12.58 | medium | 47.5 | 21.75 |
| Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 | 10.50 |
| Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.00 |
| Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.00 |
| English Springer Spaniel | sporting | 21946.0 | 12.54 | medium | 45.0 | 19.50 |
| German Shorthaired Pointer | sporting | 25842.0 | 11.46 | large | 62.5 | 24.00 |
| Pointer | sporting | 24445.0 | 12.42 | large | 59.5 | 25.50 |
| Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small | 12.0 | 10.00 |
| Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.00 |
| Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.00 |
| Shih Tzu | toy | 21152.0 | 13.20 | small | 12.5 | 9.75 |
| Irish Setter | sporting | 20323.0 | 11.63 | large | 65.0 | 26.00 |
| Golden Retriever | sporting | 21447.0 | 12.04 | medium | 60.0 | 22.75 |
| Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.50 |
| Tibetan Terrier | non-sporting | 20336.0 | 12.31 | small | 24.0 | 15.50 |
| Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.00 |
| Pug | toy | 18527.0 | 11.00 | medium | 16.0 | 16.00 |
| Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.50 |
| English Toy Spaniel | toy | 17521.0 | 10.10 | small | 11.0 | 10.00 |
| Cavalier King Charles Spaniel | toy | 18639.0 | 11.29 | small | 15.5 | 12.50 |
| Basenji | hound | 22096.0 | 13.58 | medium | 23.0 | 16.50 |
| Staffordshire Bull Terrier | terrier | 21650.0 | 12.05 | medium | 31.0 | 15.00 |
| Pembroke Welsh Corgi | herding | 23978.0 | 12.25 | small | 26.0 | 11.00 |
| Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.50 |
| Dandie Dinmont Terrier | terrier | 21633.0 | 12.17 | small | 21.0 | 9.00 |
| Giant Schnauzer | working | 26686.0 | 10.00 | large | 77.5 | 25.50 |
| Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.00 |
| Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.50 |
| Afghan Hound | hound | 24077.0 | 11.92 | large | 55.0 | 26.00 |
| Newfoundland | working | 19351.0 | 9.32 | large | 125.0 | 27.00 |
| Rhodesian Ridgeback | hound | 16530.0 | 9.10 | large | 77.5 | 25.50 |
| Borzoi | hound | 16176.0 | 9.08 | large | 82.5 | 28.00 |
| Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.50 |
| Alaskan Malamute | working | 21986.0 | 10.67 | large | 80.0 | 24.00 |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.00 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.50 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.00 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.50 |
Selecting columns¶
Selecting columns in babypandas 👶🐼¶
- In
babypandas, you selected columns using the.getmethod. .getalso works inpandas, but it is not idiomatic – people don't usually use it.
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.get('size')
breed
Brittany medium
Cairn Terrier small
English Cocker Spaniel medium
Cocker Spaniel small
...
Bloodhound large
Bullmastiff large
Mastiff large
Saint Bernard large
Name: size, Length: 43, dtype: object
# This doesn't error, but sometimes we'd like it to.
dogs.get('size oops!')
Selecting columns with []¶
- The standard way to select a column in
pandasis by using the[]operator. - Specifying a column name returns the column as a Series.
- Specifying a list of column names returns a DataFrame.
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# Returns a Series.
dogs['kind']
breed
Brittany sporting
Cairn Terrier terrier
English Cocker Spaniel sporting
Cocker Spaniel sporting
...
Bloodhound hound
Bullmastiff working
Mastiff working
Saint Bernard working
Name: kind, Length: 43, dtype: object
# Returns a DataFrame.
dogs[['kind', 'size']]
| kind | size | |
|---|---|---|
| breed | ||
| Brittany | sporting | medium |
| Cairn Terrier | terrier | small |
| English Cocker Spaniel | sporting | medium |
| Cocker Spaniel | sporting | small |
| ... | ... | ... |
| Bloodhound | hound | large |
| Bullmastiff | working | large |
| Mastiff | working | large |
| Saint Bernard | working | large |
43 rows × 2 columns
# 🤔
dogs[['kind']]
| kind | |
|---|---|
| breed | |
| Brittany | sporting |
| Cairn Terrier | terrier |
| English Cocker Spaniel | sporting |
| Cocker Spaniel | sporting |
| ... | ... |
| Bloodhound | hound |
| Bullmastiff | working |
| Mastiff | working |
| Saint Bernard | working |
43 rows × 1 columns
# Breeds are stored in the index, which is not a column!
dogs['breed']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/core/indexes/base.py:3361, in Index.get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/_libs/index.pyx:76, in pandas._libs.index.IndexEngine.get_loc() File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/_libs/index.pyx:108, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'breed' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[29], line 2 1 # Breeds are stored in the index, which is not a column! ----> 2 dogs['breed'] File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/core/frame.py:3458, in DataFrame.__getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/core/indexes/base.py:3363, in Index.get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3365 if is_scalar(key) and isna(key) and not self.hasnans: 3366 raise KeyError(key) KeyError: 'breed'
dogs.index
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel',
'Shetland Sheepdog', 'Siberian Husky', 'Lhasa Apso',
'Miniature Schnauzer', 'Chihuahua', 'English Springer Spaniel',
'German Shorthaired Pointer', 'Pointer', 'Tibetan Spaniel',
'Labrador Retriever', 'Maltese', 'Shih Tzu', 'Irish Setter',
'Golden Retriever', 'Chesapeake Bay Retriever', 'Tibetan Terrier',
'Gordon Setter', 'Pug', 'Norfolk Terrier', 'English Toy Spaniel',
'Cavalier King Charles Spaniel', 'Basenji',
'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel',
'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier',
'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland',
'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute',
'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'],
dtype='object', name='breed')
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# What are the unique dog kinds?
dogs['kind'].unique()
array(['sporting', 'terrier', 'herding', 'working', 'non-sporting', 'toy',
'hound'], dtype=object)
# What's the distribution of kinds?
dogs['kind'].value_counts()
sporting 12 terrier 8 working 7 toy 6 hound 5 non-sporting 3 herding 2 Name: kind, dtype: int64
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
11.340697674418605
# Tell me more about the 'weight' column.
dogs['weight'].describe()
count 43.00 mean 49.35 std 39.42 min 5.00 25% 18.00 50% 36.50 75% 67.50 max 175.00 Name: weight, dtype: float64
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
breed
Mastiff 13581.0
Bloodhound 13824.0
Bullmastiff 13936.0
Borzoi 16176.0
...
Tibetan Spaniel 25549.0
German Shorthaired Pointer 25842.0
Chihuahua 26250.0
Giant Schnauzer 26686.0
Name: lifetime_cost, Length: 43, dtype: float64
Subsetting rows (and columns)¶
Using loc to slice rows and columns using labels¶
# The first argument is the row label
# ↓
dogs.loc['Pug', 'longevity']
# ↑
# The second argument is the column label
11.0
💡 Pro-Tip: Using Pandas Tutor¶
If you want, you can install pandas_tutor from pip (in your terminal):
pip install pandas_tutor
Then, you can load the extension by adding:
%reload_ext pandas_tutor
At the top of your notebook. After that, you can render visualizations with the %%pandas_tutor or %%pt cell magics:
# Pandas Tutor setup. You'll need to run `pip install pandas_tutor` in your terminal
# for this cell to work, but you can also ignore the error and continue onward.
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
%%pt
dogs.loc['Pug', 'longevity']
.loc is flexible¶
.loc will expand dimensions whenever an argument is a sequence:
dogs.loc[['Pug', 'Labrador Retriever'], ['kind', 'size']]
| kind | size | |
|---|---|---|
| breed | ||
| Pug | toy | medium |
| Labrador Retriever | sporting | medium |
dogs.loc[['Pug', 'Labrador Retriever'], :]
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Pug | toy | 18527.0 | 11.00 | medium | 16.0 | 16.0 |
| Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.0 |
# Shortcut for line above
dogs.loc[['Pug', 'Labrador Retriever']]
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Pug | toy | 18527.0 | 11.00 | medium | 16.0 | 16.0 |
| Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.0 |
Review: Filtering (aka Querying)¶
- Filtering is the act of selecting rows in a DataFrame that satisfy certain condition(s).
- Comparisons with arrays (Series) result in Boolean arrays (Series).
- We can use comparisons along with the
locoperator to filter a DataFrame.
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.loc[dogs['weight'] < 10]
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.0 |
| Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
dogs.loc[dogs.index.str.contains('Spaniel')]
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| English Springer Spaniel | sporting | 21946.0 | 12.54 | medium | 45.0 | 19.5 |
| Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small | 12.0 | 10.0 |
| English Toy Spaniel | toy | 17521.0 | 10.10 | small | 11.0 | 10.0 |
| Cavalier King Charles Spaniel | toy | 18639.0 | 11.29 | small | 15.5 | 12.5 |
| Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.5 |
# Because filtering is so common, there's a shortcut:
dogs[dogs.index.str.contains('Spaniel')]
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| English Springer Spaniel | sporting | 21946.0 | 12.54 | medium | 45.0 | 19.5 |
| Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small | 12.0 | 10.0 |
| English Toy Spaniel | toy | 17521.0 | 10.10 | small | 11.0 | 10.0 |
| Cavalier King Charles Spaniel | toy | 18639.0 | 11.29 | small | 15.5 | 12.5 |
| Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.5 |
Note that because we set the index to 'breed' earlier, we can select rows based on dog breeds without having to query.
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# Series!
dogs.loc['Maltese']
kind toy lifetime_cost 19084.0 longevity 12.25 size small weight 5.0 height 9.0 Name: Maltese, dtype: object
If 'breed' was instead a column, then we'd need to query to access information about a particular school.
dogs_reset = dogs.reset_index()
dogs_reset
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| 1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| 2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| 3 | Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 39 | Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| 40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| 41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| 42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 14 | Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
Filtering with Multiple Conditions¶
Remember, you need parentheses around each condition. Also, you must use & and | instead of the and and or keywords. pandas makes weird decisions sometimes!
dogs[(dogs['weight'] < 20) & (dogs['kind'] == 'terrier')]
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
| Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
💡 Pro-Tip: Using .query (optional)¶
.query is a convenient way to filter, since you don't need parentheses and you can use the and and or keywords. We'll use it during lecture, but you won't need to use it yourself unless you'd like to. (It won't be used in our exams.)
dogs.query('weight < 20 and kind == "terrier"')
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
| Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.5 |
| Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.0 |
| Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.5 |
| Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.0 |
| Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.5 |
| Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.5 |
Don't forget iloc!¶
ilocstands for "integer location".ilocis likeloc, but it selects rows and columns based off of integer positions only.
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.iloc[3:7, :-1]
| kind | lifetime_cost | longevity | size | weight | |
|---|---|---|---|---|---|
| breed | |||||
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 |
| Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 |
| Siberian Husky | working | 22049.0 | 12.58 | medium | 47.5 |
| Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 |
iloc is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
5.5
# Finding the breed involves sorting, but not iloc.
dogs.sort_values('longevity', ascending=False).index[0]
'Chihuahua'
More Practice¶
Consider the DataFrame below.
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
| 1 | 1 | |
|---|---|---|
| 0 | fee | fo |
| 1 | fi | fum |
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. Here's a Pandas Tutor link to visualize these!
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1,1]
Adding and modifying columns¶
Adding and modifying columns, using a copy¶
- To add a new column to a DataFrame, use the
assignmethod.- To change the values in a column, add a new column with the same name as the existing column.
- Like most
pandasmethods,assignreturns a new DataFrame.- Pro ✅: This doesn't inadvertently change any existing variables.
- Con ❌: It is not very space efficient, as it creates a new copy each time it is called.
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
| kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
|---|---|---|---|---|---|---|---|
| breed | |||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 | 2048.00 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
dogs
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
💡 Pro-Tip: Method chaining¶
I recommend chaining methods together instead of writing one long line:
(dogs
.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
.sort_values('cost_per_year')
.iloc[:5]
)
| kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
|---|---|---|---|---|---|---|---|
| breed | |||||||
| Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.00 | 1557.88 |
| Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 | 10.50 | 1582.69 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.00 | 1589.02 |
| Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.00 | 1590.91 |
| Shih Tzu | toy | 21152.0 | 13.20 | small | 12.5 | 9.75 | 1602.42 |
💡 Pro-Tip: assign for column names with special characters¶
You can also use assign when the desired column name has spaces (and other special characters) by unpacking a dict:
(dogs
.assign(**{'Cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
)
| kind | lifetime_cost | longevity | size | weight | height | Cost per year 💵 | |
|---|---|---|---|---|---|---|---|
| breed | |||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 | 2048.00 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
Adding and modifying columns, in-place¶
- You can assign a new column to a DataFrame in-place using
[].- This works like dictionary assignment.
- This modifies the underlying DataFrame, unlike
assign, which returns a new DataFrame.
- This is the more "common" way of adding/modifying columns.
- ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
| kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
|---|---|---|---|---|---|---|---|
| breed | |||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 | 2048.00 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
Note that we never reassigned dogs in the cell above – that is, we never wrote dogs_copy = ... – though it was still modified.
Mutability¶
DataFrames, like lists, arrays, and dictionaries, are mutable. As you learned in DSC 20, this means that they can be modified after being created.
Not only does this explain the behavior on the previous slide, but it also explains the following:
dogs_copy.head(2)
| kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
|---|---|---|---|---|---|---|---|
| breed | |||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
def cost_in_thousands():
dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
# What happens when we run this twice?
cost_in_thousands()
dogs_copy
| kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
|---|---|---|---|---|---|---|---|
| breed | |||||||
| Brittany | sporting | 22.59 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
| Cairn Terrier | terrier | 21.99 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
| English Cocker Spaniel | sporting | 18.99 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
| Cocker Spaniel | sporting | 24.33 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13.82 | 6.75 | large | 85.0 | 25.0 | 2048.00 |
| Bullmastiff | working | 13.94 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
| Mastiff | working | 13.58 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
| Saint Bernard | working | 20.02 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
⚠️ Avoid mutation when possible¶
Note that dogs was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!
To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy() as the first line in functions that take DataFrames as input.
Also, some methods let you use the inplace=True argument to mutate the original. Don't use this argument, since future pandas releases plan to remove it.
Replacing values¶
Instead of mutation, we recommend using replace, which returns a copy of the original dataframe:
dogs.replace({'kind': {'sporting': "Sam's favorites"}})
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | Sam's favorites | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | Sam's favorites | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | Sam's favorites | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... |
| Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.0 |
| Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
| Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
| Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
Axes¶
Axes¶
- The rows and columns of a DataFrame are both stored as Series.
- The axis specifies the direction of a "slice" of a DataFrame.

- Axis 0 refers to the index (rows).
- Axis 1 refers to the columns.
DataFrame methods with axis¶
Consider the DataFrame A defined below using a dictionary.
A = pd.DataFrame({
'A': [1, 4],
'B': [2, 5],
'C': [3, 6],
})
A
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 |
If we specify axis=0, A.sum will "compress" along axis 0, and keep the column labels intact.
A.sum(axis=1)
0 6 1 15 dtype: int64
A.sum(0)
A 5 B 7 C 9 dtype: int64
If we specify axis=1, A.sum will "compress" along axis 1, and keep the row labels (index) intact.
A.sum(axis=1)
0 6 1 15 dtype: int64

What's the default axis?
A
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 |
A.sum()
A 5 B 7 C 9 dtype: int64
DataFrame methods with axis¶
- In addition to
sum, many other Series methods work on DataFrames. - In such cases, the DataFrame method usually applies the Series method to every row or column.
- Many of these methods accept an
axisargument; the default is usuallyaxis=0.
# Max element in each column
dogs.max()
kind working lifetime_cost 26686.0 longevity 16.5 size small weight 175.0 height 30.0 dtype: object
# The number of unique values in each column.
dogs.nunique()
kind 7 lifetime_cost 43 longevity 40 size 3 weight 37 height 30 dtype: int64
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe()
| lifetime_cost | longevity | weight | height | |
|---|---|---|---|---|
| count | 43.00 | 43.00 | 43.00 | 43.00 |
| mean | 20532.84 | 11.34 | 49.35 | 18.34 |
| std | 3290.78 | 2.05 | 39.42 | 6.83 |
| min | 13581.00 | 6.50 | 5.00 | 5.00 |
| 25% | 18508.50 | 10.05 | 18.00 | 11.75 |
| 50% | 21006.00 | 11.81 | 36.50 | 18.50 |
| 75% | 22072.50 | 12.52 | 67.50 | 25.00 |
| max | 26686.00 | 16.50 | 175.00 | 30.00 |
pandas and numpy¶

pandas is built upon numpy¶
- A Series in
pandasis anumpyarray with an index. - A DataFrame is like a dictionary of columns, each of which is a
numpyarray. - Many operations in
pandasare fast because they usenumpy's implementations. - If you need access the array underlying a DataFrame or Series, use the
to_numpymethod.
dogs['lifetime_cost']
breed
Brittany 22589.0
Cairn Terrier 21992.0
English Cocker Spaniel 18993.0
Cocker Spaniel 24330.0
...
Bloodhound 13824.0
Bullmastiff 13936.0
Mastiff 13581.0
Saint Bernard 20022.0
Name: lifetime_cost, Length: 43, dtype: float64
pandas data types¶
- Each Series (column) has a
numpydata type, which refers to the type of the values stored within. Access it using thedtypesattribute. - A column's data type determines which operations can be applied to it.
pandastries to guess the correct data types for a given DataFrame, and is often wrong.- This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.
dogs.dtypes
kind object lifetime_cost float64 longevity float64 size object weight float64 height float64 dtype: object
pandas data types¶
Notice that Python str types are object types in numpy and pandas.
| Pandas dtype | Python type | NumPy type | SQL type | Usage |
|---|---|---|---|---|
| int64 | int | int_, int8,...,int64, uint8,...,uint64 | INT, BIGINT | Integer numbers |
| float64 | float | float_, float16, float32, float64 | FLOAT | Floating point numbers |
| bool | bool | bool_ | BOOL | True/False values |
| datetime64 or Timestamp | datetime.datetime | datetime64 | DATETIME | Date and time values |
| timedelta64 or Timedelta | datetime.timedelta | timedelta64 | NA | Differences between two datetimes |
| category | NA | NA | ENUM | Finite list of text values |
| object | str | string, unicode | NA | Text |
| object | NA | object | NA | Mixed types |
This article details how pandas stores different data types under the hood.
This article explains how numpy/pandas int64 operations differ from vanilla int operations.
Type conversion¶
You can change the data type of a Series using the .astype Series method.
For instance, we can change the data type of the 'lifetime_cost' column in dogs to be int64:
dogs.head()
| kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|
| breed | ||||||
| Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
| Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
| English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
| Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
| Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 |
dogs.dtypes
kind object lifetime_cost float64 longevity float64 size object weight float64 height float64 dtype: object
dogs['lifetime_cost'].astype(np.int64)
breed
Brittany 22589
Cairn Terrier 21992
English Cocker Spaniel 18993
Cocker Spaniel 24330
...
Bloodhound 13824
Bullmastiff 13936
Mastiff 13581
Saint Bernard 20022
Name: lifetime_cost, Length: 43, dtype: int64
💡 Pro-Tip: Setting dtypes in read_csv¶
Usually, we prefer to set the correct dtypes in read_csv, since it can help pandas load in files more quickly:
dogs = pd.read_csv('data/dogs43.csv', dtype={'lifetime_cost': int})
dogs
| breed | kind | lifetime_cost | longevity | size | weight | height | |
|---|---|---|---|---|---|---|---|
| 0 | Brittany | sporting | 22589 | 12.92 | medium | 35.0 | 19.0 |
| 1 | Cairn Terrier | terrier | 21992 | 13.84 | small | 14.0 | 10.0 |
| 2 | English Cocker Spaniel | sporting | 18993 | 11.66 | medium | 30.0 | 16.0 |
| 3 | Cocker Spaniel | sporting | 24330 | 12.50 | small | 25.0 | 14.5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 39 | Bloodhound | hound | 13824 | 6.75 | large | 85.0 | 25.0 |
| 40 | Bullmastiff | working | 13936 | 7.57 | large | 115.0 | 25.5 |
| 41 | Mastiff | working | 13581 | 6.50 | large | 175.0 | 30.0 |
| 42 | Saint Bernard | working | 20022 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
dogs.dtypes
breed object kind object lifetime_cost int64 longevity float64 size object weight float64 height float64 dtype: object
Putting it all together¶
Talk to your neighbor about a dog breed that you personally like or know the name of. Then, try to find a few other dog breeds that are similar in weight to yours in all_dogs. Which similar breeds have the lowest and highest lifetime_cost? intelligence_rank? Are there any similar breeds that you haven't heard of before?
As a bonus, look up these dog breeds on the AKC website to see how they look!
all_dogs
| breed | group | datadog | popularity_all | ... | megarank | size | weight | height | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Border Collie | herding | 3.64 | 45 | ... | 29.0 | medium | NaN | 20.0 |
| 1 | Border Terrier | terrier | 3.61 | 80 | ... | 1.0 | small | 13.5 | NaN |
| 2 | Brittany | sporting | 3.54 | 30 | ... | 11.0 | medium | 35.0 | 19.0 |
| 3 | Cairn Terrier | terrier | 3.53 | 59 | ... | 2.0 | small | 14.0 | 10.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 168 | Welsh Terrier | terrier | NaN | 99 | ... | NaN | small | 20.0 | 15.0 |
| 169 | Wire Fox Terrier | terrier | NaN | 100 | ... | NaN | small | 17.5 | 15.0 |
| 170 | Wirehaired Pointing Griffon | sporting | NaN | 92 | ... | NaN | medium | NaN | 22.0 |
| 171 | Xoloitzcuintli | non-sporting | NaN | 155 | ... | NaN | medium | NaN | 16.5 |
172 rows × 18 columns
Summary, next time¶
Summary¶
pandasis the library for tabular data manipulation in Python.- There are three key data structures in
pandas: DataFrame, Series, and Index. - Refer to the lecture notebook and the
pandasdocumentation for tips. pandasrelies heavily onnumpy. An understanding of how data types work in both will allow you to write more efficient and bug-free code.- Series and DataFrames share many methods (refer to the
pandasdocumentation for more details). - Most
pandasmethods return copies of Series/DataFrames. Be careful when using techniques that modify values in-place. - Next time:
groupbyand data granularity.


