In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats

set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option("display.max_rows", 8)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 2 – DataFrame Fundamentals¶

DSC 80, Fall 2023¶

Pull the repo from GitHub and open lec02.ipynb, we will be coding today!

Announcements 📣¶

  • The Welcome Survey is due tonight at 11:59pm.
  • Lab 1 is released, and is due next Monday, Oct 9th at 11:59PM!
    • See the Tech Support page for instructions and watch this video 🎥 for tips on how to set up your environment and work on assignments.
    • Please try to set up your computer ASAP, since we have OH on Friday but not over the weekend to help debug your environment.
  • You may use a slip day, in which case the due date will be Oct 10th.
  • Discussion tomorrow will talk about what a conda environment is, and how to debug package import issues on your own.
  • Lecture recordings are available here .

Agenda¶

  • Whirlwind review of numpy and babypandas.
  • pandas DataFrame objects.
  • Subsetting dataframes
    • .loc, .iloc, filtering/querying

Can't cover every single detail! The pandas documentation will be your friend.

Review: numpy¶

  • NumPy stands for "numerical Python". It is a commonly-used Python module that enables fast computation involving arrays and matrices.
  • numpy's main object is the array. In numpy, arrays are:
    • Homogenous – all values are of the same type.
    • (Potentially) multi-dimensional.
  • Computation in numpy is fast because:
    • Much of it is implemented in C.
    • numpy arrays are stored more efficiently in memory than, say, Python lists.
  • This site provides a good overview of numpy arrays.

We used numpy in DSC 10 to work with sequences of data:

In [2]:
arr = np.arange(10)
arr
Out[2]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [3]:
2 ** arr
Out[3]:
array([  1,   2,   4,   8,  16,  32,  64, 128, 256, 512])

⚠️ The dangers of for-loops¶

  • for-loops are slow when processing large datasets. You will rarely write for-loops in DSC 80 (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!
  • One of the biggest benefits of numpy is that it supports vectorized operations.
    • If a and b are two arrays of the same length, then a + b is a new array of the same length containing the element-wise sum of a and b.
  • To illustrate how much faster numpy arithmetic is than using a for-loop, let's compute the squares of the numbers between 0 and 1,000,000:
    • Using a for-loop.
    • Using vectorized arithmetic, through numpy.
In [4]:
%%timeit
squares = []
for i in range(1_000_000):
    squares.append(i * i)
65 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]:
%%timeit
squares = np.arange(1_000_000) ** 2
426 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  • Python: takes about 0.06 seconds per loop
  • numpy: takes about 0.0004 seconds per loop, more than 100x faster!

Introduction to pandas 🐼¶

Baby pandas¶

  • a subset of pandas that is beginner friendly.
No description has been provided for this image

pandas¶

  • everything that you learned in babypandas will carry over.
No description has been provided for this image

pandas¶

No description has been provided for this image
  • pandas is the Python library for tabular data manipulation.
  • Before pandas was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
  • Wes McKinney, the original developer of pandas, wanted a library which would allow everything to be done in Python.
    • Python is faster to develop in than Java, and is more general-purpose than R.

pandas data structures¶

There are three key data structures at the core of pandas:

  • DataFrame: 2 dimensional tables.
  • Series: 1 dimensional array-like object, typically representing a column or row.
  • Index: sequence of column or row labels.
No description has been provided for this image

Importing pandas and related libraries¶

pandas is almost always imported in conjunction with numpy:

In [6]:
import pandas as pd
import numpy as np

Example: Dog Breeds (woof!) 🐶¶

Data originally from the American Kennel Club, which was made into a neat plot:

But...¶

The data are no longer available! One website has a slightly different version: https://tmfilho.github.io/akcdata/

We'll use the version that Sam saved while the data were still online.

In [7]:
all_dogs = pd.read_csv('data/all_dogs.csv')
all_dogs
Out[7]:
breed group datadog popularity_all ... megarank size weight height
0 Border Collie herding 3.64 45 ... 29.0 medium NaN 20.0
1 Border Terrier terrier 3.61 80 ... 1.0 small 13.5 NaN
2 Brittany sporting 3.54 30 ... 11.0 medium 35.0 19.0
3 Cairn Terrier terrier 3.53 59 ... 2.0 small 14.0 10.0
... ... ... ... ... ... ... ... ... ...
168 Welsh Terrier terrier NaN 99 ... NaN small 20.0 15.0
169 Wire Fox Terrier terrier NaN 100 ... NaN small 17.5 15.0
170 Wirehaired Pointing Griffon sporting NaN 92 ... NaN medium NaN 22.0
171 Xoloitzcuintli non-sporting NaN 155 ... NaN medium NaN 16.5

172 rows × 18 columns

In [8]:
all_dogs.columns
Out[8]:
Index(['breed', 'group', 'datadog', 'popularity_all', 'popularity',
       'lifetime_cost', 'intelligence_rank', 'longevity', 'ailments', 'price',
       'food_cost', 'grooming', 'kids', 'megarank_kids', 'megarank', 'size',
       'weight', 'height'],
      dtype='object')

Discussion Question¶

Let's refresh your DSC 10 knowledge! Find the most popular and least popular dog breeds using the popularity_all column.

In [9]:
# Fill in this cell

A Smaller Dogs Dataframe¶

The all_dogs dataframe is a bit large, so we have a smaller version here to make it easier to show pandas functionality.

In [10]:
dogs = pd.read_csv('data/dogs43.csv')
dogs
Out[10]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
3 Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ... ...
39 Bloodhound hound 13824.0 6.75 large 85.0 25.0
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 7 columns

Review: head, tail, shape, index, get, sort_values¶

To extract the first or last few rows of a DataFrame, use the head or tail methods.

In [11]:
dogs.head(3)
Out[11]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
In [12]:
dogs.tail(2)
Out[12]:
breed kind lifetime_cost longevity size weight height
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

The shape attribute returns the DataFrame's number of rows and columns.

In [13]:
dogs.shape
Out[13]:
(43, 7)
In [14]:
# Index is 0, 1, 2, ..., 171
dogs['breed']
Out[14]:
0                   Brittany
1              Cairn Terrier
2     English Cocker Spaniel
3             Cocker Spaniel
               ...          
39                Bloodhound
40               Bullmastiff
41                   Mastiff
42             Saint Bernard
Name: breed, Length: 43, dtype: object
In [15]:
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index
Out[15]:
RangeIndex(start=0, stop=43, step=1)

We know that we can use .get() to select out a few columns...

In [16]:
# This is review from DSC 10 but most people don't use .get() in practice.
# Will cover in just a few minutes...
dogs.get(['breed', 'kind', 'longevity'])
Out[16]:
breed kind longevity
0 Brittany sporting 12.92
1 Cairn Terrier terrier 13.84
2 English Cocker Spaniel sporting 11.66
3 Cocker Spaniel sporting 12.50
... ... ... ...
39 Bloodhound hound 6.75
40 Bullmastiff working 7.57
41 Mastiff working 6.50
42 Saint Bernard working 7.78

43 rows × 3 columns

To sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.

In [17]:
dogs.sort_values('longevity', ascending=False)
Out[17]:
breed kind lifetime_cost longevity size weight height
8 Chihuahua toy 26250.0 16.50 small 5.5 5.0
12 Tibetan Spaniel non-sporting 25549.0 14.42 small 12.0 10.0
6 Lhasa Apso non-sporting 22031.0 13.92 small 15.0 10.5
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
... ... ... ... ... ... ... ...
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
39 Bloodhound hound 13824.0 6.75 large 85.0 25.0
41 Mastiff working 13581.0 6.50 large 175.0 30.0

43 rows × 7 columns

Setting the index¶

Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.

In [18]:
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs
Out[18]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [19]:
# There used to be 7 columns, but now there are only 6!
dogs.shape
Out[19]:
(43, 6)

💡 Pro-tip: Displaying more rows/columns¶

Sometimes, you just want pandas to display a lot of rows and columns. You can use this helper function to do that:

In [20]:
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
    """Displays n rows and cols from df"""
    with pd.option_context("display.max_rows", rows,
                           "display.max_columns", cols):
        display(df)
In [21]:
display_df(dogs, rows=43)
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.00
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.00
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.00
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.50
Shetland Sheepdog herding 21006.0 12.53 small 22.0 14.50
Siberian Husky working 22049.0 12.58 medium 47.5 21.75
Lhasa Apso non-sporting 22031.0 13.92 small 15.0 10.50
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.00
Chihuahua toy 26250.0 16.50 small 5.5 5.00
English Springer Spaniel sporting 21946.0 12.54 medium 45.0 19.50
German Shorthaired Pointer sporting 25842.0 11.46 large 62.5 24.00
Pointer sporting 24445.0 12.42 large 59.5 25.50
Tibetan Spaniel non-sporting 25549.0 14.42 small 12.0 10.00
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.00
Maltese toy 19084.0 12.25 small 5.0 9.00
Shih Tzu toy 21152.0 13.20 small 12.5 9.75
Irish Setter sporting 20323.0 11.63 large 65.0 26.00
Golden Retriever sporting 21447.0 12.04 medium 60.0 22.75
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.50
Tibetan Terrier non-sporting 20336.0 12.31 small 24.0 15.50
Gordon Setter sporting 19605.0 11.10 large 62.5 25.00
Pug toy 18527.0 11.00 medium 16.0 16.00
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.50
English Toy Spaniel toy 17521.0 10.10 small 11.0 10.00
Cavalier King Charles Spaniel toy 18639.0 11.29 small 15.5 12.50
Basenji hound 22096.0 13.58 medium 23.0 16.50
Staffordshire Bull Terrier terrier 21650.0 12.05 medium 31.0 15.00
Pembroke Welsh Corgi herding 23978.0 12.25 small 26.0 11.00
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.50
Dandie Dinmont Terrier terrier 21633.0 12.17 small 21.0 9.00
Giant Schnauzer working 26686.0 10.00 large 77.5 25.50
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.00
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.50
Afghan Hound hound 24077.0 11.92 large 55.0 26.00
Newfoundland working 19351.0 9.32 large 125.0 27.00
Rhodesian Ridgeback hound 16530.0 9.10 large 77.5 25.50
Borzoi hound 16176.0 9.08 large 82.5 28.00
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.50
Alaskan Malamute working 21986.0 10.67 large 80.0 24.00
Bloodhound hound 13824.0 6.75 large 85.0 25.00
Bullmastiff working 13936.0 7.57 large 115.0 25.50
Mastiff working 13581.0 6.50 large 175.0 30.00
Saint Bernard working 20022.0 7.78 large 155.0 26.50

Selecting columns¶

Selecting columns in babypandas 👶🐼¶

  • In babypandas, you selected columns using the .get method.
  • .get also works in pandas, but it is not idiomatic – people don't usually use it.
In [22]:
dogs
Out[22]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [23]:
dogs.get('size')
Out[23]:
breed
Brittany                  medium
Cairn Terrier              small
English Cocker Spaniel    medium
Cocker Spaniel             small
                           ...  
Bloodhound                 large
Bullmastiff                large
Mastiff                    large
Saint Bernard              large
Name: size, Length: 43, dtype: object
In [24]:
# This doesn't error, but sometimes we'd like it to.
dogs.get('size oops!')

Selecting columns with []¶

  • The standard way to select a column in pandas is by using the [] operator.
  • Specifying a column name returns the column as a Series.
  • Specifying a list of column names returns a DataFrame.
In [25]:
dogs
Out[25]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [26]:
# Returns a Series.
dogs['kind']
Out[26]:
breed
Brittany                  sporting
Cairn Terrier              terrier
English Cocker Spaniel    sporting
Cocker Spaniel            sporting
                            ...   
Bloodhound                   hound
Bullmastiff                working
Mastiff                    working
Saint Bernard              working
Name: kind, Length: 43, dtype: object
In [27]:
# Returns a DataFrame.
dogs[['kind', 'size']]
Out[27]:
kind size
breed
Brittany sporting medium
Cairn Terrier terrier small
English Cocker Spaniel sporting medium
Cocker Spaniel sporting small
... ... ...
Bloodhound hound large
Bullmastiff working large
Mastiff working large
Saint Bernard working large

43 rows × 2 columns

In [28]:
# 🤔
dogs[['kind']]
Out[28]:
kind
breed
Brittany sporting
Cairn Terrier terrier
English Cocker Spaniel sporting
Cocker Spaniel sporting
... ...
Bloodhound hound
Bullmastiff working
Mastiff working
Saint Bernard working

43 rows × 1 columns

In [29]:
# Breeds are stored in the index, which is not a column!
dogs['breed']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/core/indexes/base.py:3361, in Index.get_loc(self, key, method, tolerance)
   3360 try:
-> 3361     return self._engine.get_loc(casted_key)
   3362 except KeyError as err:

File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/_libs/index.pyx:76, in pandas._libs.index.IndexEngine.get_loc()

File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/_libs/index.pyx:108, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'breed'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[29], line 2
      1 # Breeds are stored in the index, which is not a column!
----> 2 dogs['breed']

File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/core/frame.py:3458, in DataFrame.__getitem__(self, key)
   3456 if self.columns.nlevels > 1:
   3457     return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
   3459 if is_integer(indexer):
   3460     indexer = [indexer]

File ~/mambaforge/envs/dsc80/lib/python3.8/site-packages/pandas/core/indexes/base.py:3363, in Index.get_loc(self, key, method, tolerance)
   3361         return self._engine.get_loc(casted_key)
   3362     except KeyError as err:
-> 3363         raise KeyError(key) from err
   3365 if is_scalar(key) and isna(key) and not self.hasnans:
   3366     raise KeyError(key)

KeyError: 'breed'
In [30]:
dogs.index
Out[30]:
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel',
       'Shetland Sheepdog', 'Siberian Husky', 'Lhasa Apso',
       'Miniature Schnauzer', 'Chihuahua', 'English Springer Spaniel',
       'German Shorthaired Pointer', 'Pointer', 'Tibetan Spaniel',
       'Labrador Retriever', 'Maltese', 'Shih Tzu', 'Irish Setter',
       'Golden Retriever', 'Chesapeake Bay Retriever', 'Tibetan Terrier',
       'Gordon Setter', 'Pug', 'Norfolk Terrier', 'English Toy Spaniel',
       'Cavalier King Charles Spaniel', 'Basenji',
       'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel',
       'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier',
       'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland',
       'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute',
       'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'],
      dtype='object', name='breed')

Useful Series methods¶

There are a variety of useful methods that work on Series. You can see the entire list here. Many methods that work on a Series will also work on DataFrames, as we'll soon see.

In [31]:
dogs
Out[31]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [32]:
# What are the unique dog kinds?
dogs['kind'].unique()
Out[32]:
array(['sporting', 'terrier', 'herding', 'working', 'non-sporting', 'toy',
       'hound'], dtype=object)
In [33]:
# What's the distribution of kinds?
dogs['kind'].value_counts()
Out[33]:
sporting        12
terrier          8
working          7
toy              6
hound            5
non-sporting     3
herding          2
Name: kind, dtype: int64
In [34]:
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
Out[34]:
11.340697674418605
In [35]:
# Tell me more about the 'weight' column.
dogs['weight'].describe()
Out[35]:
count     43.00
mean      49.35
std       39.42
min        5.00
25%       18.00
50%       36.50
75%       67.50
max      175.00
Name: weight, dtype: float64
In [36]:
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
Out[36]:
breed
Mastiff                       13581.0
Bloodhound                    13824.0
Bullmastiff                   13936.0
Borzoi                        16176.0
                               ...   
Tibetan Spaniel               25549.0
German Shorthaired Pointer    25842.0
Chihuahua                     26250.0
Giant Schnauzer               26686.0
Name: lifetime_cost, Length: 43, dtype: float64

Subsetting rows (and columns)¶

Using loc to slice rows and columns using labels¶

In [37]:
# The first argument is the row label
#        ↓
dogs.loc['Pug', 'longevity']
#                  ↑
# The second argument is the column label
Out[37]:
11.0

💡 Pro-Tip: Using Pandas Tutor¶

If you want, you can install pandas_tutor from pip (in your terminal):

pip install pandas_tutor

Then, you can load the extension by adding:

%reload_ext pandas_tutor

At the top of your notebook. After that, you can render visualizations with the %%pandas_tutor or %%pt cell magics:

In [38]:
# Pandas Tutor setup. You'll need to run `pip install pandas_tutor` in your terminal
# for this cell to work, but you can also ignore the error and continue onward.
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
In [39]:
%%pt
dogs.loc['Pug', 'longevity']

.loc is flexible¶

.loc will expand dimensions whenever an argument is a sequence:

In [40]:
dogs.loc[['Pug', 'Labrador Retriever'], ['kind', 'size']]
Out[40]:
kind size
breed
Pug toy medium
Labrador Retriever sporting medium
In [41]:
dogs.loc[['Pug', 'Labrador Retriever'], :]
Out[41]:
kind lifetime_cost longevity size weight height
breed
Pug toy 18527.0 11.00 medium 16.0 16.0
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.0
In [42]:
# Shortcut for line above
dogs.loc[['Pug', 'Labrador Retriever']]
Out[42]:
kind lifetime_cost longevity size weight height
breed
Pug toy 18527.0 11.00 medium 16.0 16.0
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.0

Review: Filtering (aka Querying)¶

  • Filtering is the act of selecting rows in a DataFrame that satisfy certain condition(s).
  • Comparisons with arrays (Series) result in Boolean arrays (Series).
  • We can use comparisons along with the loc operator to filter a DataFrame.
In [43]:
dogs
Out[43]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [44]:
dogs.loc[dogs['weight'] < 10]
Out[44]:
kind lifetime_cost longevity size weight height
breed
Chihuahua toy 26250.0 16.50 small 5.5 5.0
Maltese toy 19084.0 12.25 small 5.0 9.0
In [45]:
dogs.loc[dogs.index.str.contains('Spaniel')]
Out[45]:
kind lifetime_cost longevity size weight height
breed
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
English Springer Spaniel sporting 21946.0 12.54 medium 45.0 19.5
Tibetan Spaniel non-sporting 25549.0 14.42 small 12.0 10.0
English Toy Spaniel toy 17521.0 10.10 small 11.0 10.0
Cavalier King Charles Spaniel toy 18639.0 11.29 small 15.5 12.5
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.5
In [46]:
# Because filtering is so common, there's a shortcut:
dogs[dogs.index.str.contains('Spaniel')]
Out[46]:
kind lifetime_cost longevity size weight height
breed
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
English Springer Spaniel sporting 21946.0 12.54 medium 45.0 19.5
Tibetan Spaniel non-sporting 25549.0 14.42 small 12.0 10.0
English Toy Spaniel toy 17521.0 10.10 small 11.0 10.0
Cavalier King Charles Spaniel toy 18639.0 11.29 small 15.5 12.5
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.5

Note that because we set the index to 'breed' earlier, we can select rows based on dog breeds without having to query.

In [47]:
dogs
Out[47]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [48]:
# Series!
dogs.loc['Maltese']
Out[48]:
kind                 toy
lifetime_cost    19084.0
longevity          12.25
size               small
weight               5.0
height               9.0
Name: Maltese, dtype: object

If 'breed' was instead a column, then we'd need to query to access information about a particular school.

In [49]:
dogs_reset = dogs.reset_index()
dogs_reset
Out[49]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
3 Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ... ...
39 Bloodhound hound 13824.0 6.75 large 85.0 25.0
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 7 columns

In [50]:
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']
Out[50]:
breed kind lifetime_cost longevity size weight height
14 Maltese toy 19084.0 12.25 small 5.0 9.0

Filtering with Multiple Conditions¶

Remember, you need parentheses around each condition. Also, you must use & and | instead of the and and or keywords. pandas makes weird decisions sometimes!

In [51]:
dogs[(dogs['weight'] < 20) & (dogs['kind'] == 'terrier')]
Out[51]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5

💡 Pro-Tip: Using .query (optional)¶

.query is a convenient way to filter, since you don't need parentheses and you can use the and and or keywords. We'll use it during lecture, but you won't need to use it yourself unless you'd like to. (It won't be used in our exams.)

In [52]:
dogs.query('weight < 20 and kind == "terrier"')
Out[52]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5
In [53]:
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
Out[53]:
kind lifetime_cost longevity size weight height
breed
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.5
Gordon Setter sporting 19605.0 11.10 large 62.5 25.0
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.5
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.0
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.5
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.5

Don't forget iloc!¶

  • iloc stands for "integer location".
  • iloc is like loc, but it selects rows and columns based off of integer positions only.
In [54]:
dogs
Out[54]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [55]:
dogs.iloc[3:7, :-1]
Out[55]:
kind lifetime_cost longevity size weight
breed
Cocker Spaniel sporting 24330.0 12.50 small 25.0
Shetland Sheepdog herding 21006.0 12.53 small 22.0
Siberian Husky working 22049.0 12.58 medium 47.5
Lhasa Apso non-sporting 22031.0 13.92 small 15.0

iloc is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:

In [56]:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
Out[56]:
5.5
In [57]:
# Finding the breed involves sorting, but not iloc.
dogs.sort_values('longevity', ascending=False).index[0]
Out[57]:
'Chihuahua'

More Practice¶

Consider the DataFrame below.

In [58]:
jack = pd.DataFrame({1: ['fee', 'fi'], 
                     '1': ['fo', 'fum']})
jack
Out[58]:
1 1
0 fee fo
1 fi fum

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. Here's a Pandas Tutor link to visualize these!

In [59]:
# jack[1]
In [60]:
# jack[[1]]
In [61]:
# jack['1']
In [62]:
# jack[[1, 1]]
In [63]:
# jack.loc[1]
In [64]:
# jack.loc[jack[1] == 'fo']
In [65]:
# jack[1, ['1', 1]]
In [66]:
# jack.loc[1,1]

Adding and modifying columns¶

Adding and modifying columns, using a copy¶

  • To add a new column to a DataFrame, use the assign method.
    • To change the values in a column, add a new column with the same name as the existing column.
  • Like most pandas methods, assign returns a new DataFrame.
    • Pro ✅: This doesn't inadvertently change any existing variables.
    • Con ❌: It is not very space efficient, as it creates a new copy each time it is called.
In [67]:
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
Out[67]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5 1946.40
... ... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0 2048.00
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

In [68]:
dogs
Out[68]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

💡 Pro-Tip: Method chaining¶

I recommend chaining methods together instead of writing one long line:

In [69]:
(dogs
 .assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
 .sort_values('cost_per_year')
 .iloc[:5]
)
Out[69]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Maltese toy 19084.0 12.25 small 5.0 9.00 1557.88
Lhasa Apso non-sporting 22031.0 13.92 small 15.0 10.50 1582.69
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.00 1589.02
Chihuahua toy 26250.0 16.50 small 5.5 5.00 1590.91
Shih Tzu toy 21152.0 13.20 small 12.5 9.75 1602.42

💡 Pro-Tip: assign for column names with special characters¶

You can also use assign when the desired column name has spaces (and other special characters) by unpacking a dict:

In [70]:
(dogs
 .assign(**{'Cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
)
Out[70]:
kind lifetime_cost longevity size weight height Cost per year 💵
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5 1946.40
... ... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0 2048.00
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

Adding and modifying columns, in-place¶

  • You can assign a new column to a DataFrame in-place using [].
    • This works like dictionary assignment.
    • This modifies the underlying DataFrame, unlike assign, which returns a new DataFrame.
  • This is the more "common" way of adding/modifying columns.
    • ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
In [71]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
Out[71]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
In [72]:
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
Out[72]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5 1946.40
... ... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0 2048.00
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

Note that we never reassigned dogs in the cell above – that is, we never wrote dogs_copy = ... – though it was still modified.

Mutability¶

DataFrames, like lists, arrays, and dictionaries, are mutable. As you learned in DSC 20, this means that they can be modified after being created.

Not only does this explain the behavior on the previous slide, but it also explains the following:

In [73]:
dogs_copy.head(2)
Out[73]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
In [74]:
def cost_in_thousands():
    dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
In [75]:
# What happens when we run this twice?
cost_in_thousands()
In [76]:
dogs_copy
Out[76]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22.59 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21.99 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18.99 11.66 medium 30.0 16.0 1628.90
Cocker Spaniel sporting 24.33 12.50 small 25.0 14.5 1946.40
... ... ... ... ... ... ... ...
Bloodhound hound 13.82 6.75 large 85.0 25.0 2048.00
Bullmastiff working 13.94 7.57 large 115.0 25.5 1840.95
Mastiff working 13.58 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20.02 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

⚠️ Avoid mutation when possible¶

Note that dogs was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!

To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy() as the first line in functions that take DataFrames as input.

Also, some methods let you use the inplace=True argument to mutate the original. Don't use this argument, since future pandas releases plan to remove it.

Replacing values¶

Instead of mutation, we recommend using replace, which returns a copy of the original dataframe:

In [77]:
dogs.replace({'kind': {'sporting': "Sam's favorites"}})
Out[77]:
kind lifetime_cost longevity size weight height
breed
Brittany Sam's favorites 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel Sam's favorites 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel Sam's favorites 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bloodhound hound 13824.0 6.75 large 85.0 25.0
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

Axes¶

Axes¶

  • The rows and columns of a DataFrame are both stored as Series.
  • The axis specifies the direction of a "slice" of a DataFrame.
No description has been provided for this image
  • Axis 0 refers to the index (rows).
  • Axis 1 refers to the columns.

DataFrame methods with axis¶

Consider the DataFrame A defined below using a dictionary.

In [78]:
A = pd.DataFrame({
    'A': [1, 4],
    'B': [2, 5],
    'C': [3, 6],
})
A
Out[78]:
A B C
0 1 2 3
1 4 5 6

If we specify axis=0, A.sum will "compress" along axis 0, and keep the column labels intact.

In [79]:
A.sum(axis=1)
Out[79]:
0     6
1    15
dtype: int64
In [80]:
A.sum(0)
Out[80]:
A    5
B    7
C    9
dtype: int64

If we specify axis=1, A.sum will "compress" along axis 1, and keep the row labels (index) intact.

In [81]:
A.sum(axis=1)
Out[81]:
0     6
1    15
dtype: int64
No description has been provided for this image

What's the default axis?

In [82]:
A
Out[82]:
A B C
0 1 2 3
1 4 5 6
In [83]:
A.sum()
Out[83]:
A    5
B    7
C    9
dtype: int64

DataFrame methods with axis¶

  • In addition to sum, many other Series methods work on DataFrames.
  • In such cases, the DataFrame method usually applies the Series method to every row or column.
  • Many of these methods accept an axis argument; the default is usually axis=0.
In [84]:
# Max element in each column
dogs.max()
Out[84]:
kind             working
lifetime_cost    26686.0
longevity           16.5
size               small
weight             175.0
height              30.0
dtype: object
In [85]:
# The number of unique values in each column.
dogs.nunique()
Out[85]:
kind              7
lifetime_cost    43
longevity        40
size              3
weight           37
height           30
dtype: int64
In [86]:
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe()
Out[86]:
lifetime_cost longevity weight height
count 43.00 43.00 43.00 43.00
mean 20532.84 11.34 49.35 18.34
std 3290.78 2.05 39.42 6.83
min 13581.00 6.50 5.00 5.00
25% 18508.50 10.05 18.00 11.75
50% 21006.00 11.81 36.50 18.50
75% 22072.50 12.52 67.50 25.00
max 26686.00 16.50 175.00 30.00

pandas and numpy¶

No description has been provided for this image

pandas is built upon numpy¶

  • A Series in pandas is a numpy array with an index.
  • A DataFrame is like a dictionary of columns, each of which is a numpy array.
  • Many operations in pandas are fast because they use numpy's implementations.
  • If you need access the array underlying a DataFrame or Series, use the to_numpy method.
In [87]:
dogs['lifetime_cost']
Out[87]:
breed
Brittany                  22589.0
Cairn Terrier             21992.0
English Cocker Spaniel    18993.0
Cocker Spaniel            24330.0
                           ...   
Bloodhound                13824.0
Bullmastiff               13936.0
Mastiff                   13581.0
Saint Bernard             20022.0
Name: lifetime_cost, Length: 43, dtype: float64

pandas data types¶

  • Each Series (column) has a numpy data type, which refers to the type of the values stored within. Access it using the dtypes attribute.
  • A column's data type determines which operations can be applied to it.
  • pandas tries to guess the correct data types for a given DataFrame, and is often wrong.
    • This can lead to incorrect calculations and poor memory/time performance.
  • As a result, you will often need to explicitly convert between data types.
In [88]:
dogs.dtypes
Out[88]:
kind              object
lifetime_cost    float64
longevity        float64
size              object
weight           float64
height           float64
dtype: object

pandas data types¶

Notice that Python str types are object types in numpy and pandas.

Pandas dtype Python type NumPy type SQL type Usage
int64 int int_, int8,...,int64, uint8,...,uint64 INT, BIGINT Integer numbers
float64 float float_, float16, float32, float64 FLOAT Floating point numbers
bool bool bool_ BOOL True/False values
datetime64 or Timestamp datetime.datetime datetime64 DATETIME Date and time values
timedelta64 or Timedelta datetime.timedelta timedelta64 NA Differences between two datetimes
category NA NA ENUM Finite list of text values
object str string, unicode NA Text
object NA object NA Mixed types

This article details how pandas stores different data types under the hood.

This article explains how numpy/pandas int64 operations differ from vanilla int operations.

Type conversion¶

You can change the data type of a Series using the .astype Series method.

For instance, we can change the data type of the 'lifetime_cost' column in dogs to be int64:

In [89]:
dogs.head()
Out[89]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
Shetland Sheepdog herding 21006.0 12.53 small 22.0 14.5
In [90]:
dogs.dtypes
Out[90]:
kind              object
lifetime_cost    float64
longevity        float64
size              object
weight           float64
height           float64
dtype: object
In [91]:
dogs['lifetime_cost'].astype(np.int64)
Out[91]:
breed
Brittany                  22589
Cairn Terrier             21992
English Cocker Spaniel    18993
Cocker Spaniel            24330
                          ...  
Bloodhound                13824
Bullmastiff               13936
Mastiff                   13581
Saint Bernard             20022
Name: lifetime_cost, Length: 43, dtype: int64

💡 Pro-Tip: Setting dtypes in read_csv¶

Usually, we prefer to set the correct dtypes in read_csv, since it can help pandas load in files more quickly:

In [92]:
dogs = pd.read_csv('data/dogs43.csv', dtype={'lifetime_cost': int})
dogs
Out[92]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993 11.66 medium 30.0 16.0
3 Cocker Spaniel sporting 24330 12.50 small 25.0 14.5
... ... ... ... ... ... ... ...
39 Bloodhound hound 13824 6.75 large 85.0 25.0
40 Bullmastiff working 13936 7.57 large 115.0 25.5
41 Mastiff working 13581 6.50 large 175.0 30.0
42 Saint Bernard working 20022 7.78 large 155.0 26.5

43 rows × 7 columns

In [93]:
dogs.dtypes
Out[93]:
breed             object
kind              object
lifetime_cost      int64
longevity        float64
size              object
weight           float64
height           float64
dtype: object

Putting it all together¶

Talk to your neighbor about a dog breed that you personally like or know the name of. Then, try to find a few other dog breeds that are similar in weight to yours in all_dogs. Which similar breeds have the lowest and highest lifetime_cost? intelligence_rank? Are there any similar breeds that you haven't heard of before?

As a bonus, look up these dog breeds on the AKC website to see how they look!

In [97]:
all_dogs
Out[97]:
breed group datadog popularity_all ... megarank size weight height
0 Border Collie herding 3.64 45 ... 29.0 medium NaN 20.0
1 Border Terrier terrier 3.61 80 ... 1.0 small 13.5 NaN
2 Brittany sporting 3.54 30 ... 11.0 medium 35.0 19.0
3 Cairn Terrier terrier 3.53 59 ... 2.0 small 14.0 10.0
... ... ... ... ... ... ... ... ... ...
168 Welsh Terrier terrier NaN 99 ... NaN small 20.0 15.0
169 Wire Fox Terrier terrier NaN 100 ... NaN small 17.5 15.0
170 Wirehaired Pointing Griffon sporting NaN 92 ... NaN medium NaN 22.0
171 Xoloitzcuintli non-sporting NaN 155 ... NaN medium NaN 16.5

172 rows × 18 columns

Summary, next time¶

Summary¶

  • pandas is the library for tabular data manipulation in Python.
  • There are three key data structures in pandas: DataFrame, Series, and Index.
  • Refer to the lecture notebook and the pandas documentation for tips.
  • pandas relies heavily on numpy. An understanding of how data types work in both will allow you to write more efficient and bug-free code.
  • Series and DataFrames share many methods (refer to the pandas documentation for more details).
  • Most pandas methods return copies of Series/DataFrames. Be careful when using techniques that modify values in-place.
  • Next time: groupby and data granularity.