In [1]:
from dsc80_utils import *

Lecture 1 – Introduction, Data Science Lifecycle¶

Welcome to DSC 259R! 🎉

Agenda¶

  • Who are we?
  • What does a data scientist do?
  • What is this course about, and how will it run?
  • The data science lifecycle.
  • Example: What's in a name?

Instructor: Samuel Lau (call me Sam)¶

Prof. Sam Lau¶

No description has been provided for this image
  • Assistant Teaching Professor, HDSI, UCSD
  • Website: https://lau.ucsd.edu/

I design curriculum and invent tools for teaching programming and data science.

Bio: Ph.D. UCSD (2023), M.S. UC Berkeley (2018), B.S. UC Berkeley (2017).

  • My second year as a professor at UCSD 🎉
  • Pandas Tutor visualizes pandas code 📊: https://pandastutor.com/
  • Learning Data Science, a free textbook on data science 📚: https://learningds.org
  • Outside the classroom 👨‍🏫: cooking, eating out, woodworking

Course staff¶

In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:

  • 1 graduate student TA: Mizuho Fukada.
  • 7 undergraduate tutors: Andrew Yang, Anish Kasam, Gabriel Cha, Luran Zhang, Qirui Zheng, Sunan Xu, and Ylesia Wu.

What is data science? 🤔¶

What is data science?¶


No description has been provided for this image

Everyone seems to have their own definition of what data science is!

One common definition¶

One view is that data science is about drawing useful conclusions from data using computation. For example, by:

  • Using Python to explore and visualize data.
  • Using simulation to make inferences about a population, given just a sample.
  • Making predictions about the future given data from the past.

Let's look at a few more definitions of data science.

What is data science?¶

No description has been provided for this image

In 2010, Drew Conway published his famous Data Science Venn Diagram.

What is data science?¶

There isn't agreement on which "Venn Diagram" is correct!

No description has been provided for this image
  • Why not? The field is new and rapidly developing.
  • Make sure you're solid on the fundamentals, then find a niche that you enjoy.
  • Read Taylor, Battle of the Data Science Venn Diagrams.

What does a data scientist do?¶

The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?

No description has been provided for this image

The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:

What technologies will have the biggest effect on compensation in the coming year?

No description has been provided for this image

What does a data scientist do?¶

Our take: we are training you to ask and answer questions using data.

As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!

Let's look at some examples of data science in practice.

Do people care about climate change?¶

From How Americans Think About Climate Change, in Six Maps.

No description has been provided for this image
No description has been provided for this image

Do people care about climate change?¶

No description has been provided for this image
No description has been provided for this image

An excerpt from the article:

Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.

Data science involves people 🧍¶

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

  • Flu case forecasting.
  • Admissions and hiring.
  • Hyper-personalized ad recommendations.

What is this course really about, then?¶

  • Good data analysis is not:
    • A simple application of a statistics formula.
    • A simple application of computer programs.
  • There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!

Course content¶

Course goals¶

We will teach you to think like a data scientist.

In this course, you will...

  • Get a taste of the "life of a data scientist."
  • Practice translating potentially vague questions into quantitative questions about measurable observations.
  • Learn to reason about "black-box" processes (e.g. complicated models).
  • Understand computational and statistical implications of working with data.
  • Learn to use real data tools (and rely on documentation).

Course outcomes¶

After this course, you will...

  • Be prepared for internships and data science "take home" interviews!
  • Be ready to create your own portfolio of personal projects.

Topics¶

  • Week 1: Introduction to pandas.
  • Week 2: DataFrames.
  • Week 3: Working with messy data, hypothesis and permutation testing.
  • Week 4: Missing values.
  • Week 5: HTML, Midterm Exam.
  • Week 6: Web and text data.
  • Week 7: Text data, modeling.
  • Week 8: Feature engineering and generalization.
  • Week 9: Modeling in sklearn.
  • Week 10: Classifier evaluation, fairness, conclusion.
  • Week 11: Final Exam

The data science lifecycle 🚴¶

The scientific method¶

You learned about the scientific method in elementary school.

No description has been provided for this image

However, it hides a lot of complexity.

  • Where did the hypothesis come from?
  • What data are you modeling? Is the data sufficient?
  • Under which conditions are the conclusions valid?

The data science lifecycle¶

No description has been provided for this image

All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.

DataFrame Fundamentals¶

Let's do a review of pandas fundamentals that I assume you've seen before (but may need a refresher!).

Example: Dog Breeds (woof!) 🐶¶

Let's take a look at some data about dogs that comes from the American Kennel Club. Here's a cool plot made using our dataset.

No description has been provided for this image
In [2]:
# You'll see the Path(...) / subpath syntax a lot.
# It creates the correct path to your file,
# whether you're using Windows, macOS, or Linux.
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs
Out[2]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 7 columns

Review: head, tail, shape, index, and sort_values¶

To extract the first or last few rows of a DataFrame, use the head or tail methods.

In [3]:
dogs.head(3)
Out[3]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
In [4]:
dogs.tail(2)
Out[4]:
breed kind lifetime_cost longevity size weight height
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

The shape attribute returns the DataFrame's number of rows and columns.

In [5]:
dogs.shape
Out[5]:
(43, 7)
In [6]:
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index
Out[6]:
RangeIndex(start=0, stop=43, step=1)

And lastly, remember that to sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.

In [7]:
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False)
Out[7]:
breed kind lifetime_cost longevity size weight height
41 Mastiff working 13581.0 6.50 large 175.0 30.0
36 Borzoi hound 16176.0 9.08 large 82.5 28.0
34 Newfoundland working 19351.0 9.32 large 125.0 27.0
... ... ... ... ... ... ... ...
29 Dandie Dinmont Terrier terrier 21633.0 12.17 small 21.0 9.0
14 Maltese toy 19084.0 12.25 small 5.0 9.0
8 Chihuahua toy 26250.0 16.50 small 5.5 5.0

43 rows × 7 columns

In [8]:
# This sorts by 'height',
# then breaks ties by 'longevity'.
# Note the difference in the last three rows between
# this DataFrame and the one above.
dogs.sort_values(['height', 'longevity'],
                 ascending=False)
Out[8]:
breed kind lifetime_cost longevity size weight height
41 Mastiff working 13581.0 6.50 large 175.0 30.0
36 Borzoi hound 16176.0 9.08 large 82.5 28.0
34 Newfoundland working 19351.0 9.32 large 125.0 27.0
... ... ... ... ... ... ... ...
14 Maltese toy 19084.0 12.25 small 5.0 9.0
29 Dandie Dinmont Terrier terrier 21633.0 12.17 small 21.0 9.0
8 Chihuahua toy 26250.0 16.50 small 5.5 5.0

43 rows × 7 columns

Note that dogs is not the DataFrame above. To save our changes, we'd need to say something like dogs = dogs.sort_values....

In [9]:
dogs
Out[9]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 7 columns

Setting the index¶

Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.

In [10]:
dogs.set_index('breed')
Out[10]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [11]:
# The above cell didn't involve an assignment statement,
# so dogs was unchanged.
dogs
Out[11]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 7 columns

In [12]:
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs
Out[12]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [13]:
# There used to be 7 columns, but now there are only 6!
dogs.shape
Out[13]:
(43, 6)

💡 Pro-Tip: Displaying more rows/columns¶

Sometimes, you just want pandas to display a lot of rows and columns. You can use this helper function to do that. It's not important to understand how it works, feel free to just use it!

In [14]:
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
    """Displays n rows and cols from df."""
    with pd.option_context("display.max_rows", rows,
                           "display.max_columns", cols):
        display(df)
In [15]:
display_df(dogs.sort_values('weight', ascending=False),
           rows=43)
kind lifetime_cost longevity size weight height
breed
Mastiff working 13581.0 6.50 large 175.0 30.00
Saint Bernard working 20022.0 7.78 large 155.0 26.50
Newfoundland working 19351.0 9.32 large 125.0 27.00
Bullmastiff working 13936.0 7.57 large 115.0 25.50
Bloodhound hound 13824.0 6.75 large 85.0 25.00
Borzoi hound 16176.0 9.08 large 82.5 28.00
Alaskan Malamute working 21986.0 10.67 large 80.0 24.00
Rhodesian Ridgeback hound 16530.0 9.10 large 77.5 25.50
Giant Schnauzer working 26686.0 10.00 large 77.5 25.50
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.50
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.00
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.50
Irish Setter sporting 20323.0 11.63 large 65.0 26.00
German Shorthaired Pointer sporting 25842.0 11.46 large 62.5 24.00
Gordon Setter sporting 19605.0 11.10 large 62.5 25.00
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.50
Golden Retriever sporting 21447.0 12.04 medium 60.0 22.75
Pointer sporting 24445.0 12.42 large 59.5 25.50
Afghan Hound hound 24077.0 11.92 large 55.0 26.00
Siberian Husky working 22049.0 12.58 medium 47.5 21.75
English Springer Spaniel sporting 21946.0 12.54 medium 45.0 19.50
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.50
Brittany sporting 22589.0 12.92 medium 35.0 19.00
Staffordshire Bull Terrier terrier 21650.0 12.05 medium 31.0 15.00
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.00
Pembroke Welsh Corgi herding 23978.0 12.25 small 26.0 11.00
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.50
Tibetan Terrier non-sporting 20336.0 12.31 small 24.0 15.50
Basenji hound 22096.0 13.58 medium 23.0 16.50
Shetland Sheepdog herding 21006.0 12.53 small 22.0 14.50
Dandie Dinmont Terrier terrier 21633.0 12.17 small 21.0 9.00
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.00
Pug toy 18527.0 11.00 medium 16.0 16.00
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.00
Cavalier King Charles Spaniel toy 18639.0 11.29 small 15.5 12.50
Lhasa Apso non-sporting 22031.0 13.92 small 15.0 10.50
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.00
Shih Tzu toy 21152.0 13.20 small 12.5 9.75
Tibetan Spaniel non-sporting 25549.0 14.42 small 12.0 10.00
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.50
English Toy Spaniel toy 17521.0 10.10 small 11.0 10.00
Chihuahua toy 26250.0 16.50 small 5.5 5.00
Maltese toy 19084.0 12.25 small 5.0 9.00

Subsetting¶

We use subsetting (also called slicing) to get contiguous rows and columns out of our DataFrame.

  • The standard way to select a column in pandas is by using the [] operator.
  • Specifying a column name returns the column as a Series.
  • Specifying a list of column names returns a DataFrame.
In [16]:
# Returns a Series.
dogs['kind']
Out[16]:
breed
Brittany                  sporting
Cairn Terrier              terrier
English Cocker Spaniel    sporting
                            ...   
Bullmastiff                working
Mastiff                    working
Saint Bernard              working
Name: kind, Length: 43, dtype: object
In [17]:
# Returns a DataFrame.
dogs[['kind', 'size']]
Out[17]:
kind size
breed
Brittany sporting medium
Cairn Terrier terrier small
English Cocker Spaniel sporting medium
... ... ...
Bullmastiff working large
Mastiff working large
Saint Bernard working large

43 rows × 2 columns

In [18]:
# 🤔
dogs[['kind']]
Out[18]:
kind
breed
Brittany sporting
Cairn Terrier terrier
English Cocker Spaniel sporting
... ...
Bullmastiff working
Mastiff working
Saint Bernard working

43 rows × 1 columns

In [19]:
# Breeds are stored in the index, which is not a column!
dogs['breed']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/repos/dsc80/private/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'breed'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[19], line 2
      1 # Breeds are stored in the index, which is not a column!
----> 2 dogs['breed']

File ~/repos/dsc80/private/.venv/lib/python3.13/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/repos/dsc80/private/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'breed'
In [20]:
dogs.index
Out[20]:
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel',
       'Shetland Sheepdog', 'Siberian Husky', 'Lhasa Apso',
       'Miniature Schnauzer', 'Chihuahua', 'English Springer Spaniel',
       'German Shorthaired Pointer', 'Pointer', 'Tibetan Spaniel',
       'Labrador Retriever', 'Maltese', 'Shih Tzu', 'Irish Setter',
       'Golden Retriever', 'Chesapeake Bay Retriever', 'Tibetan Terrier',
       'Gordon Setter', 'Pug', 'Norfolk Terrier', 'English Toy Spaniel',
       'Cavalier King Charles Spaniel', 'Basenji',
       'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel',
       'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier',
       'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland',
       'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute',
       'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'],
      dtype='object', name='breed')

Useful Series methods¶

There are a variety of useful methods that work on Series. You can see the entire list here. Many methods that work on a Series will also work on DataFrames, as we'll soon see.

In [21]:
dogs
Out[21]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [22]:
# What are the unique kinds of dogs?
dogs['kind'].unique()
Out[22]:
array(['sporting', 'terrier', 'herding', 'working', 'non-sporting', 'toy',
       'hound'], dtype=object)
In [23]:
# How many unique kinds of dogs are there?
dogs['kind'].nunique()
Out[23]:
7
In [24]:
# What's the distribution of kinds?
dogs['kind'].value_counts()
Out[24]:
kind
sporting        12
terrier          8
working          7
toy              6
hound            5
non-sporting     3
herding          2
Name: count, dtype: int64
In [25]:
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
Out[25]:
np.float64(11.340697674418605)
In [26]:
# Tell me more about the 'weight' column.
dogs['weight'].describe()
Out[26]:
count     43.00
mean      49.35
std       39.42
          ...  
50%       36.50
75%       67.50
max      175.00
Name: weight, Length: 8, dtype: float64
In [27]:
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
Out[27]:
breed
Mastiff                       13581.0
Bloodhound                    13824.0
Bullmastiff                   13936.0
                               ...   
German Shorthaired Pointer    25842.0
Chihuahua                     26250.0
Giant Schnauzer               26686.0
Name: lifetime_cost, Length: 43, dtype: float64
In [28]:
# Gives us the index of the largest value, not the largest value itself.
dogs['lifetime_cost'].idxmax()
Out[28]:
'Giant Schnauzer'

Use loc to slice rows and columns using labels¶

loc uses row labels and column labels.

In [29]:
dogs
Out[29]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [30]:
# The first argument is the row label.
#        ↓
dogs.loc['Pug', 'longevity']
#                  ↑
# The second argument is the column label.
Out[30]:
np.float64(11.0)

As an aside, loc is not a method – it's an indexer.

In [31]:
type(dogs.loc)
Out[31]:
pandas.core.indexing._LocIndexer
In [32]:
type(dogs.sort_values)
Out[32]:
method

💡 Pro-Tip: Using Pandas Tutor¶

Pandas Tutor (developed by Sam) is a tool that you can use to visualize pandas code. It comes with your DSC 80 environment.

You can load the extension by adding:

%reload_ext pandas_tutor

At the top of your notebook. After that, you can render visualizations with the %%pt cell magic 🪄:

In [33]:
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
In [34]:
%%pt
dogs.loc['Pug', 'longevity']

.loc is flexible 🧘¶

You can provide a sequence (list, array, Series) as either argument to .loc.

In [35]:
dogs
Out[35]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [36]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'size']
Out[36]:
breed
Cocker Spaniel         small
Labrador Retriever    medium
Name: size, dtype: object
In [37]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], ['kind', 'size', 'height']]
Out[37]:
kind size height
breed
Cocker Spaniel sporting small 14.5
Labrador Retriever sporting medium 23.0
In [38]:
# Note that the 'weight' column is included!
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'lifetime_cost': 'weight']
Out[38]:
lifetime_cost longevity size weight
breed
Cocker Spaniel 24330.0 12.50 small 25.0
Labrador Retriever 21299.0 12.04 medium 67.5
In [39]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], :]
Out[39]:
kind lifetime_cost longevity size weight height
breed
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.0
In [40]:
# Shortcut for the line above.
dogs.loc[['Cocker Spaniel', 'Labrador Retriever']]
Out[40]:
kind lifetime_cost longevity size weight height
breed
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.0

Filtering (or Querying)¶

  • Filtering (aka querying) is the act of selecting rows in a DataFrame that satisfy certain condition(s).
  • Comparisons with arrays (or Series) result in Boolean arrays (or Series).
  • We can use comparisons along with the loc operator to filter a DataFrame.
In [41]:
dogs
Out[41]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [42]:
dogs.loc[dogs['weight'] < 10]
Out[42]:
kind lifetime_cost longevity size weight height
breed
Chihuahua toy 26250.0 16.50 small 5.5 5.0
Maltese toy 19084.0 12.25 small 5.0 9.0
In [43]:
dogs.loc[dogs.index.str.contains('Retriever')]
Out[43]:
kind lifetime_cost longevity size weight height
breed
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.00
Golden Retriever sporting 21447.0 12.04 medium 60.0 22.75
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.50
In [44]:
# Because querying is so common, there's a shortcut:
dogs[dogs.index.str.contains('Retriever')]
Out[44]:
kind lifetime_cost longevity size weight height
breed
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.00
Golden Retriever sporting 21447.0 12.04 medium 60.0 22.75
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.50
In [45]:
# Empty DataFrame – not an error!
dogs.loc[dogs['kind'] == 'beaver']
Out[45]:
kind lifetime_cost longevity size weight height
breed

Note that because we set the index to 'breed' earlier, we can select rows based on dog breeds without having to query.

In [46]:
dogs
Out[46]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [47]:
# Series!
dogs.loc['Maltese']
Out[47]:
kind                 toy
lifetime_cost    19084.0
longevity          12.25
size               small
weight               5.0
height               9.0
Name: Maltese, dtype: object

If 'breed' was instead a column, then we'd need to query to access information about a particular breed.

In [48]:
dogs_reset = dogs.reset_index()
dogs_reset
Out[48]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
40 Bullmastiff working 13936.0 7.57 large 115.0 25.5
41 Mastiff working 13581.0 6.50 large 175.0 30.0
42 Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 7 columns

In [49]:
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']
Out[49]:
breed kind lifetime_cost longevity size weight height
14 Maltese toy 19084.0 12.25 small 5.0 9.0

Filtering with multiple conditions¶

Remember, you need parentheses around each condition. Also, you must use the bitwise operators & and | instead of the standard and and or keywords. pandas makes weird decisions sometimes!

In [50]:
dogs
Out[50]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [51]:
dogs[(dogs['weight'] < 20) & (dogs['kind'] == 'terrier')]
Out[51]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5

💡 Pro-Tip: Using .query¶

.query is a convenient way to query, since you don't need parentheses and you can use the and and or keywords.

In [52]:
dogs
Out[52]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [53]:
dogs.query('weight < 20 and kind == "terrier"')
Out[53]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5
In [54]:
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
Out[54]:
kind lifetime_cost longevity size weight height
breed
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.5
Gordon Setter sporting 19605.0 11.10 large 62.5 25.0
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.5
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.0
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.5
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.5

Don't forget iloc!¶

  • iloc stands for "integer location."
  • iloc is like loc, but it selects rows and columns based off of integer positions only, just like with 2D arrays.
In [55]:
dogs
Out[55]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [56]:
dogs.iloc[1:15, :-2]
Out[56]:
kind lifetime_cost longevity size
breed
Cairn Terrier terrier 21992.0 13.84 small
English Cocker Spaniel sporting 18993.0 11.66 medium
Cocker Spaniel sporting 24330.0 12.50 small
... ... ... ... ...
Tibetan Spaniel non-sporting 25549.0 14.42 small
Labrador Retriever sporting 21299.0 12.04 medium
Maltese toy 19084.0 12.25 small

14 rows × 4 columns

iloc is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:

In [57]:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
Out[57]:
np.float64(5.5)
In [58]:
# Finding the breed itself involves sorting, but not iloc.
dogs.sort_values('longevity', ascending=False).index[0]
Out[58]:
'Chihuahua'

Practice¶

Consider the DataFrame below.

In [59]:
jack = pd.DataFrame({1: ['fee', 'fi'],
                     '1': ['fo', 'fum']})
jack
Out[59]:
1 1
0 fee fo
1 fi fum

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. Here's a Pandas Tutor link to visualize these!

In [60]:
# jack[1]
In [61]:
# jack[[1]]
In [62]:
# jack['1']
In [63]:
# jack[[1, 1]]
In [64]:
# jack.loc[1]
In [65]:
# jack.loc[jack[1] == 'fo']
In [66]:
# jack[1, ['1', 1]]
In [67]:
# jack.loc[1,1]

Adding and modifying columns¶

Adding and modifying columns, using a copy¶

  • To add a new column to a DataFrame, use the assign method.
    • To change the values in a column, add a new column with the same name as the existing column.
  • Like most pandas methods, assign returns a new DataFrame.
    • Pro ✅: This doesn't inadvertently change any existing variables.
    • Con ❌: It is not very memory efficient, as it creates a new copy each time it is called.
In [68]:
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
Out[68]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

In [69]:
dogs
Out[69]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

💡 Pro-Tip: Method chaining¶

Chain methods together instead of writing long, hard-to-read lines.

In [70]:
# Finds the rows corresponding to the five cheapest to own breeds on a per-year basis.
(dogs
 .assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
 .sort_values('cost_per_year')
 .iloc[:5]
)
Out[70]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Maltese toy 19084.0 12.25 small 5.0 9.00 1557.88
Lhasa Apso non-sporting 22031.0 13.92 small 15.0 10.50 1582.69
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.00 1589.02
Chihuahua toy 26250.0 16.50 small 5.5 5.00 1590.91
Shih Tzu toy 21152.0 13.20 small 12.5 9.75 1602.42

Adding and modifying columns, in-place¶

  • You can assign a new column to a DataFrame in-place using [].
    • This works like dictionary assignment.
    • This modifies the underlying DataFrame, unlike assign, which returns a new DataFrame.
  • This is the more "common" way of adding/modifying columns.
    • ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
In [71]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
Out[71]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
In [72]:
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
Out[72]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

Note that we never reassigned dogs_copy in the cell above – that is, we never wrote dogs_copy = ... – though it was still modified.

Mutability¶

DataFrames, like lists, arrays, and dictionaries, are mutable. This means that they can be modified after being created. (For instance, the list .append method mutates in-place.)

Not only does this explain the behavior on the previous slide, but it also explains the following:

In [73]:
dogs_copy
Out[73]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

In [74]:
def cost_in_thousands():
    dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
In [75]:
# What happens when we run this twice?
cost_in_thousands()
In [76]:
dogs_copy
Out[76]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22.59 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21.99 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18.99 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13.94 7.57 large 115.0 25.5 1840.95
Mastiff working 13.58 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20.02 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

⚠️ Avoid mutation when possible¶

Note that dogs_copy was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!

To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy() as the first line in functions that take DataFrames as input.

Also, some methods let you use the inplace=True argument to mutate the original. Don't use this argument, since future pandas releases plan to remove it.

Example: What's in a name?¶

Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'¶

This New York Times article claims that baby names beginning with "L" have become more popular over time.

Let's see if these claims are true, based on the data!

The data¶

What we're seeing below is a pandas DataFrame. The DataFrame contains one row for every combination of 'Name', 'Sex', and 'Year'.

In [77]:
baby = pd.read_csv('data/baby.csv')
baby
Out[77]:
Name Sex Count Year
0 Liam M 20456 2022
1 Noah M 18621 2022
2 Olivia F 16573 2022
... ... ... ... ...
2085155 Wright M 5 1880
2085156 York M 5 1880
2085157 Zachariah M 5 1880

2085158 rows × 4 columns

To get columns from a dataframe, using indexing syntax (like accessing a value from a Python dictionary).

In [78]:
baby['Count'].sum()
Out[78]:
np.int64(365296191)

How many unique names were there per year?¶

In [79]:
baby.groupby('Year').size()
Out[79]:
Year
1880     2000
1881     1934
1882     2127
        ...  
2020    31517
2021    31685
2022    31915
Length: 143, dtype: int64

A shortcut to the above is as follows:

In [80]:
baby['Year'].value_counts()
Out[80]:
Year
2008    35094
2007    34966
2009    34724
        ...  
1883     2084
1880     2000
1881     1934
Name: count, Length: 143, dtype: int64

Why doesn't the above Series actually contain the number of unique names per year?

In [81]:
baby[baby['Year'] == 1880]
Out[81]:
Name Sex Count Year
2083158 John M 9655 1880
2083159 William M 9532 1880
2083160 Mary F 7065 1880
... ... ... ... ...
2085155 Wright M 5 1880
2085156 York M 5 1880
2085157 Zachariah M 5 1880

2000 rows × 4 columns

In [82]:
baby[baby['Year'] == 1880].value_counts('Name')
Out[82]:
Name
Grace      2
Emma       2
Clair      2
          ..
Evaline    1
Evalena    1
Zula       1
Name: count, Length: 1889, dtype: int64

How many babies were recorded per year?¶

In [83]:
baby.groupby('Year')['Count'].sum()
Out[83]:
Year
1880     201484
1881     192690
1882     221533
         ...   
2020    3333981
2021    3379713
2022    3361896
Name: Count, Length: 143, dtype: int64
In [84]:
baby.groupby('Year')['Count'].sum().plot()

Don't worry about the code for now, we'll explain next time.

"'L' has to be like the consonant of the decade."¶

In [85]:
(baby
 .assign(first_letter=baby['Name'].str[0])
 .query('first_letter == "L"')
 .groupby('Year')
 ['Count']
 .sum()
 .plot(title='Number of Babies Born with an "L" Name Per Year')
)

What about individual names?¶

In [86]:
(baby
 .query('Name == "Siri"')
 .groupby('Year')
 ['Count']
 .sum()
 .plot(title='Number of Babies Born Named "Siri" Per Year')
)
In [87]:
def name_graph(name):
    return (baby
     .query(f'Name == "{name}"')
     .groupby('Year')
     ['Count']
     .sum()
     .plot(title=f'Number of Babies Born Named "{name}" Per Year')
    )
In [88]:
name_graph('Samuel')

What about other names?¶

In [89]:
name_graph(...)