from dsc80_utils import *
Lecture 1 – Introduction, Data Science Lifecycle¶
Welcome to DSC 259R! 🎉
Agenda¶
- Who are we?
- What does a data scientist do?
- What is this course about, and how will it run?
- The data science lifecycle.
- Example: What's in a name?
Instructor: Samuel Lau (call me Sam)¶
Prof. Sam Lau¶
- Assistant Teaching Professor, HDSI, UCSD
- Website: https://lau.ucsd.edu/
I design curriculum and invent tools for teaching programming and data science.
Bio: Ph.D. UCSD (2023), M.S. UC Berkeley (2018), B.S. UC Berkeley (2017).
- My second year as a professor at UCSD 🎉
- Pandas Tutor visualizes
pandas
code 📊: https://pandastutor.com/ - Learning Data Science, a free textbook on data science 📚: https://learningds.org
- Outside the classroom 👨🏫: cooking, eating out, woodworking
Course staff¶
In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:
- 1 graduate student TA: Mizuho Fukada.
- 7 undergraduate tutors: Andrew Yang, Anish Kasam, Gabriel Cha, Luran Zhang, Qirui Zheng, Sunan Xu, and Ylesia Wu.
What is data science? 🤔¶
One common definition¶
One view is that data science is about drawing useful conclusions from data using computation. For example, by:
- Using Python to explore and visualize data.
- Using simulation to make inferences about a population, given just a sample.
- Making predictions about the future given data from the past.
Let's look at a few more definitions of data science.
What is data science?¶
There isn't agreement on which "Venn Diagram" is correct!

- Why not? The field is new and rapidly developing.
- Make sure you're solid on the fundamentals, then find a niche that you enjoy.
- Read Taylor, Battle of the Data Science Venn Diagrams.
What does a data scientist do?¶
The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?

The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:
What technologies will have the biggest effect on compensation in the coming year?

What does a data scientist do?¶
Our take: we are training you to ask and answer questions using data.
As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!
Let's look at some examples of data science in practice.
An excerpt from the article:
Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.
Data science involves people 🧍¶
The decisions that we make as data scientists have the potential to impact the livelihoods of other people.
- Flu case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.
What is this course really about, then?¶
- Good data analysis is not:
- A simple application of a statistics formula.
- A simple application of computer programs.
- There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!
Course content¶
Course goals¶
We will teach you to think like a data scientist.
In this course, you will...
- Get a taste of the "life of a data scientist."
- Practice translating potentially vague questions into quantitative questions about measurable observations.
- Learn to reason about "black-box" processes (e.g. complicated models).
- Understand computational and statistical implications of working with data.
- Learn to use real data tools (and rely on documentation).
Course outcomes¶
After this course, you will...
- Be prepared for internships and data science "take home" interviews!
- Be ready to create your own portfolio of personal projects.
Topics¶
- Week 1: Introduction to
pandas
. - Week 2: DataFrames.
- Week 3: Working with messy data, hypothesis and permutation testing.
- Week 4: Missing values.
- Week 5: HTML, Midterm Exam.
- Week 6: Web and text data.
- Week 7: Text data, modeling.
- Week 8: Feature engineering and generalization.
- Week 9: Modeling in
sklearn
. - Week 10: Classifier evaluation, fairness, conclusion.
- Week 11: Final Exam
The data science lifecycle 🚴¶
The scientific method¶
You learned about the scientific method in elementary school.

However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?
The data science lifecycle¶
All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.
DataFrame Fundamentals¶
Let's do a review of pandas
fundamentals that I assume you've seen before (but may need a refresher!).
# You'll see the Path(...) / subpath syntax a lot.
# It creates the correct path to your file,
# whether you're using Windows, macOS, or Linux.
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
Review: head
, tail
, shape
, index
, and sort_values
¶
To extract the first or last few rows of a DataFrame, use the head
or tail
methods.
dogs.head(3)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
dogs.tail(2)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
The shape
attribute returns the DataFrame's number of rows and columns.
dogs.shape
(43, 7)
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index
RangeIndex(start=0, stop=43, step=1)
And lastly, remember that to sort by a column, use the sort_values
method. Like most DataFrame and Series methods, sort_values
returns a new DataFrame, and doesn't modify the original.
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
36 | Borzoi | hound | 16176.0 | 9.08 | large | 82.5 | 28.0 |
34 | Newfoundland | working | 19351.0 | 9.32 | large | 125.0 | 27.0 |
... | ... | ... | ... | ... | ... | ... | ... |
29 | Dandie Dinmont Terrier | terrier | 21633.0 | 12.17 | small | 21.0 | 9.0 |
14 | Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
8 | Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.0 |
43 rows × 7 columns
# This sorts by 'height',
# then breaks ties by 'longevity'.
# Note the difference in the last three rows between
# this DataFrame and the one above.
dogs.sort_values(['height', 'longevity'],
ascending=False)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
36 | Borzoi | hound | 16176.0 | 9.08 | large | 82.5 | 28.0 |
34 | Newfoundland | working | 19351.0 | 9.32 | large | 125.0 | 27.0 |
... | ... | ... | ... | ... | ... | ... | ... |
14 | Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
29 | Dandie Dinmont Terrier | terrier | 21633.0 | 12.17 | small | 21.0 | 9.0 |
8 | Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.0 |
43 rows × 7 columns
Note that dogs
is not the DataFrame above. To save our changes, we'd need to say something like dogs = dogs.sort_values...
.
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
Setting the index¶
Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index
method.
dogs.set_index('breed')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# The above cell didn't involve an assignment statement,
# so dogs was unchanged.
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# There used to be 7 columns, but now there are only 6!
dogs.shape
(43, 6)
💡 Pro-Tip: Displaying more rows/columns¶
Sometimes, you just want pandas
to display a lot of rows and columns. You can use this helper function to do that. It's not important to understand how it works, feel free to just use it!
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
"""Displays n rows and cols from df."""
with pd.option_context("display.max_rows", rows,
"display.max_columns", cols):
display(df)
display_df(dogs.sort_values('weight', ascending=False),
rows=43)
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.00 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.50 |
Newfoundland | working | 19351.0 | 9.32 | large | 125.0 | 27.00 |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.50 |
Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.00 |
Borzoi | hound | 16176.0 | 9.08 | large | 82.5 | 28.00 |
Alaskan Malamute | working | 21986.0 | 10.67 | large | 80.0 | 24.00 |
Rhodesian Ridgeback | hound | 16530.0 | 9.10 | large | 77.5 | 25.50 |
Giant Schnauzer | working | 26686.0 | 10.00 | large | 77.5 | 25.50 |
Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.50 |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.00 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.50 |
Irish Setter | sporting | 20323.0 | 11.63 | large | 65.0 | 26.00 |
German Shorthaired Pointer | sporting | 25842.0 | 11.46 | large | 62.5 | 24.00 |
Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.00 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.50 |
Golden Retriever | sporting | 21447.0 | 12.04 | medium | 60.0 | 22.75 |
Pointer | sporting | 24445.0 | 12.42 | large | 59.5 | 25.50 |
Afghan Hound | hound | 24077.0 | 11.92 | large | 55.0 | 26.00 |
Siberian Husky | working | 22049.0 | 12.58 | medium | 47.5 | 21.75 |
English Springer Spaniel | sporting | 21946.0 | 12.54 | medium | 45.0 | 19.50 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.50 |
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.00 |
Staffordshire Bull Terrier | terrier | 21650.0 | 12.05 | medium | 31.0 | 15.00 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.00 |
Pembroke Welsh Corgi | herding | 23978.0 | 12.25 | small | 26.0 | 11.00 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.50 |
Tibetan Terrier | non-sporting | 20336.0 | 12.31 | small | 24.0 | 15.50 |
Basenji | hound | 22096.0 | 13.58 | medium | 23.0 | 16.50 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.50 |
Dandie Dinmont Terrier | terrier | 21633.0 | 12.17 | small | 21.0 | 9.00 |
Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.00 |
Pug | toy | 18527.0 | 11.00 | medium | 16.0 | 16.00 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.00 |
Cavalier King Charles Spaniel | toy | 18639.0 | 11.29 | small | 15.5 | 12.50 |
Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 | 10.50 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.00 |
Shih Tzu | toy | 21152.0 | 13.20 | small | 12.5 | 9.75 |
Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small | 12.0 | 10.00 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.50 |
English Toy Spaniel | toy | 17521.0 | 10.10 | small | 11.0 | 10.00 |
Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.00 |
Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.00 |
Subsetting¶
We use subsetting (also called slicing) to get contiguous rows and columns out of our DataFrame.
- The standard way to select a column in
pandas
is by using the[]
operator. - Specifying a column name returns the column as a Series.
- Specifying a list of column names returns a DataFrame.
# Returns a Series.
dogs['kind']
breed Brittany sporting Cairn Terrier terrier English Cocker Spaniel sporting ... Bullmastiff working Mastiff working Saint Bernard working Name: kind, Length: 43, dtype: object
# Returns a DataFrame.
dogs[['kind', 'size']]
kind | size | |
---|---|---|
breed | ||
Brittany | sporting | medium |
Cairn Terrier | terrier | small |
English Cocker Spaniel | sporting | medium |
... | ... | ... |
Bullmastiff | working | large |
Mastiff | working | large |
Saint Bernard | working | large |
43 rows × 2 columns
# 🤔
dogs[['kind']]
kind | |
---|---|
breed | |
Brittany | sporting |
Cairn Terrier | terrier |
English Cocker Spaniel | sporting |
... | ... |
Bullmastiff | working |
Mastiff | working |
Saint Bernard | working |
43 rows × 1 columns
# Breeds are stored in the index, which is not a column!
dogs['breed']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/repos/dsc80/private/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'breed' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[19], line 2 1 # Breeds are stored in the index, which is not a column! ----> 2 dogs['breed'] File ~/repos/dsc80/private/.venv/lib/python3.13/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key) 4100 if self.columns.nlevels > 1: 4101 return self._getitem_multilevel(key) -> 4102 indexer = self.columns.get_loc(key) 4103 if is_integer(indexer): 4104 indexer = [indexer] File ~/repos/dsc80/private/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 'breed'
dogs.index
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel', 'Shetland Sheepdog', 'Siberian Husky', 'Lhasa Apso', 'Miniature Schnauzer', 'Chihuahua', 'English Springer Spaniel', 'German Shorthaired Pointer', 'Pointer', 'Tibetan Spaniel', 'Labrador Retriever', 'Maltese', 'Shih Tzu', 'Irish Setter', 'Golden Retriever', 'Chesapeake Bay Retriever', 'Tibetan Terrier', 'Gordon Setter', 'Pug', 'Norfolk Terrier', 'English Toy Spaniel', 'Cavalier King Charles Spaniel', 'Basenji', 'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel', 'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier', 'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland', 'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute', 'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'], dtype='object', name='breed')
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# What are the unique kinds of dogs?
dogs['kind'].unique()
array(['sporting', 'terrier', 'herding', 'working', 'non-sporting', 'toy', 'hound'], dtype=object)
# How many unique kinds of dogs are there?
dogs['kind'].nunique()
7
# What's the distribution of kinds?
dogs['kind'].value_counts()
kind sporting 12 terrier 8 working 7 toy 6 hound 5 non-sporting 3 herding 2 Name: count, dtype: int64
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
np.float64(11.340697674418605)
# Tell me more about the 'weight' column.
dogs['weight'].describe()
count 43.00 mean 49.35 std 39.42 ... 50% 36.50 75% 67.50 max 175.00 Name: weight, Length: 8, dtype: float64
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
breed Mastiff 13581.0 Bloodhound 13824.0 Bullmastiff 13936.0 ... German Shorthaired Pointer 25842.0 Chihuahua 26250.0 Giant Schnauzer 26686.0 Name: lifetime_cost, Length: 43, dtype: float64
# Gives us the index of the largest value, not the largest value itself.
dogs['lifetime_cost'].idxmax()
'Giant Schnauzer'
Use loc
to slice rows and columns using labels¶
loc
uses row labels and column labels.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# The first argument is the row label.
# ↓
dogs.loc['Pug', 'longevity']
# ↑
# The second argument is the column label.
np.float64(11.0)
As an aside, loc
is not a method – it's an indexer.
type(dogs.loc)
pandas.core.indexing._LocIndexer
type(dogs.sort_values)
method
💡 Pro-Tip: Using Pandas Tutor¶
Pandas Tutor (developed by Sam) is a tool that you can use to visualize pandas
code. It comes with your DSC 80 environment.
You can load the extension by adding:
%reload_ext pandas_tutor
At the top of your notebook. After that, you can render visualizations with the %%pt
cell magic 🪄:
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
%%pt
dogs.loc['Pug', 'longevity']
.loc
is flexible 🧘¶
You can provide a sequence (list, array, Series) as either argument to .loc
.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'size']
breed Cocker Spaniel small Labrador Retriever medium Name: size, dtype: object
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], ['kind', 'size', 'height']]
kind | size | height | |
---|---|---|---|
breed | |||
Cocker Spaniel | sporting | small | 14.5 |
Labrador Retriever | sporting | medium | 23.0 |
# Note that the 'weight' column is included!
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'lifetime_cost': 'weight']
lifetime_cost | longevity | size | weight | |
---|---|---|---|---|
breed | ||||
Cocker Spaniel | 24330.0 | 12.50 | small | 25.0 |
Labrador Retriever | 21299.0 | 12.04 | medium | 67.5 |
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], :]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.0 |
# Shortcut for the line above.
dogs.loc[['Cocker Spaniel', 'Labrador Retriever']]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.0 |
Filtering (or Querying)¶
- Filtering (aka querying) is the act of selecting rows in a DataFrame that satisfy certain condition(s).
- Comparisons with arrays (or Series) result in Boolean arrays (or Series).
- We can use comparisons along with the
loc
operator to filter a DataFrame.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.loc[dogs['weight'] < 10]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.0 |
Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
dogs.loc[dogs.index.str.contains('Retriever')]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.00 |
Golden Retriever | sporting | 21447.0 | 12.04 | medium | 60.0 | 22.75 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.50 |
# Because querying is so common, there's a shortcut:
dogs[dogs.index.str.contains('Retriever')]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.00 |
Golden Retriever | sporting | 21447.0 | 12.04 | medium | 60.0 | 22.75 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.50 |
# Empty DataFrame – not an error!
dogs.loc[dogs['kind'] == 'beaver']
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed |
Note that because we set the index to 'breed'
earlier, we can select rows based on dog breeds without having to query.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# Series!
dogs.loc['Maltese']
kind toy lifetime_cost 19084.0 longevity 12.25 size small weight 5.0 height 9.0 Name: Maltese, dtype: object
If 'breed'
was instead a column, then we'd need to query to access information about a particular breed.
dogs_reset = dogs.reset_index()
dogs_reset
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
41 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
42 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
14 | Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
Filtering with multiple conditions¶
Remember, you need parentheses around each condition. Also, you must use the bitwise operators &
and |
instead of the standard and
and or
keywords. pandas
makes weird decisions sometimes!
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs[(dogs['weight'] < 20) & (dogs['kind'] == 'terrier')]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
💡 Pro-Tip: Using .query
¶
.query
is a convenient way to query, since you don't need parentheses and you can use the and
and or
keywords.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.query('weight < 20 and kind == "terrier"')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.5 |
Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.0 |
Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.5 |
Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.0 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.5 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.5 |
Don't forget iloc
!¶
iloc
stands for "integer location."iloc
is likeloc
, but it selects rows and columns based off of integer positions only, just like with 2D arrays.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.iloc[1:15, :-2]
kind | lifetime_cost | longevity | size | |
---|---|---|---|---|
breed | ||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small |
... | ... | ... | ... | ... |
Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium |
Maltese | toy | 19084.0 | 12.25 | small |
14 rows × 4 columns
iloc
is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
np.float64(5.5)
# Finding the breed itself involves sorting, but not iloc.
dogs.sort_values('longevity', ascending=False).index[0]
'Chihuahua'
Practice¶
Consider the DataFrame below.
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
1 | 1 | |
---|---|---|
0 | fee | fo |
1 | fi | fum |
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. Here's a Pandas Tutor link to visualize these!
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1,1]
Adding and modifying columns¶
Adding and modifying columns, using a copy¶
- To add a new column to a DataFrame, use the
assign
method.- To change the values in a column, add a new column with the same name as the existing column.
- Like most
pandas
methods,assign
returns a new DataFrame.- Pro ✅: This doesn't inadvertently change any existing variables.
- Con ❌: It is not very memory efficient, as it creates a new copy each time it is called.
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
💡 Pro-Tip: Method chaining¶
Chain methods together instead of writing long, hard-to-read lines.
# Finds the rows corresponding to the five cheapest to own breeds on a per-year basis.
(dogs
.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
.sort_values('cost_per_year')
.iloc[:5]
)
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.00 | 1557.88 |
Lhasa Apso | non-sporting | 22031.0 | 13.92 | small | 15.0 | 10.50 | 1582.69 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.00 | 1589.02 |
Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.00 | 1590.91 |
Shih Tzu | toy | 21152.0 | 13.20 | small | 12.5 | 9.75 | 1602.42 |
Adding and modifying columns, in-place¶
- You can assign a new column to a DataFrame in-place using
[]
.- This works like dictionary assignment.
- This modifies the underlying DataFrame, unlike
assign
, which returns a new DataFrame.
- This is the more "common" way of adding/modifying columns.
- ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
Note that we never reassigned dogs_copy
in the cell above – that is, we never wrote dogs_copy = ...
– though it was still modified.
Mutability¶
DataFrames, like lists, arrays, and dictionaries, are mutable. This means that they can be modified after being created. (For instance, the list .append
method mutates in-place.)
Not only does this explain the behavior on the previous slide, but it also explains the following:
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
def cost_in_thousands():
dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
# What happens when we run this twice?
cost_in_thousands()
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22.59 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21.99 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18.99 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13.94 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13.58 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20.02 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
⚠️ Avoid mutation when possible¶
Note that dogs_copy
was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!
To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy()
as the first line in functions that take DataFrames as input.
Also, some methods let you use the inplace=True
argument to mutate the original. Don't use this argument, since future pandas
releases plan to remove it.
Example: What's in a name?¶
Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'¶
This New York Times article claims that baby names beginning with "L" have become more popular over time.
Let's see if these claims are true, based on the data!
The data¶
What we're seeing below is a pandas
DataFrame. The DataFrame contains one row for every combination of 'Name'
, 'Sex'
, and 'Year'
.
baby = pd.read_csv('data/baby.csv')
baby
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2085158 rows × 4 columns
To get columns from a dataframe, using indexing syntax (like accessing a value from a Python dictionary).
baby['Count'].sum()
np.int64(365296191)
How many unique names were there per year?¶
baby.groupby('Year').size()
Year 1880 2000 1881 1934 1882 2127 ... 2020 31517 2021 31685 2022 31915 Length: 143, dtype: int64
A shortcut to the above is as follows:
baby['Year'].value_counts()
Year 2008 35094 2007 34966 2009 34724 ... 1883 2084 1880 2000 1881 1934 Name: count, Length: 143, dtype: int64
Why doesn't the above Series actually contain the number of unique names per year?
baby[baby['Year'] == 1880]
Name | Sex | Count | Year | |
---|---|---|---|---|
2083158 | John | M | 9655 | 1880 |
2083159 | William | M | 9532 | 1880 |
2083160 | Mary | F | 7065 | 1880 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2000 rows × 4 columns
baby[baby['Year'] == 1880].value_counts('Name')
Name Grace 2 Emma 2 Clair 2 .. Evaline 1 Evalena 1 Zula 1 Name: count, Length: 1889, dtype: int64
How many babies were recorded per year?¶
baby.groupby('Year')['Count'].sum()
Year 1880 201484 1881 192690 1882 221533 ... 2020 3333981 2021 3379713 2022 3361896 Name: Count, Length: 143, dtype: int64
baby.groupby('Year')['Count'].sum().plot()
Don't worry about the code for now, we'll explain next time.
"'L' has to be like the consonant of the decade."¶
(baby
.assign(first_letter=baby['Name'].str[0])
.query('first_letter == "L"')
.groupby('Year')
['Count']
.sum()
.plot(title='Number of Babies Born with an "L" Name Per Year')
)
What about individual names?¶
(baby
.query('Name == "Siri"')
.groupby('Year')
['Count']
.sum()
.plot(title='Number of Babies Born Named "Siri" Per Year')
)
def name_graph(name):
return (baby
.query(f'Name == "{name}"')
.groupby('Year')
['Count']
.sum()
.plot(title=f'Number of Babies Born Named "{name}" Per Year')
)
name_graph('Samuel')
What about other names?¶
name_graph(...)