from dsc80_utils import *
Lecture 1 – Introduction, Data Science Lifecycle¶
Welcome to DSC 259R! 🎉
Agenda¶
- Who are we?
- What does a data scientist do?
- What is this course about, and how will it run?
- The data science lifecycle.
- Example: What's in a name?
Instructor: Samuel Lau (call me Sam)¶
Prof. Sam Lau¶
- Assistant Teaching Professor, HDSI, UCSD
- Website: https://lau.ucsd.edu/
I design curriculum and invent tools for teaching programming and data science.
Bio: Ph.D. UCSD (2023), M.S. UC Berkeley (2018), B.S. UC Berkeley (2017).
- My second year as a professor at UCSD 🎉
- Pandas Tutor visualizes
pandas
code 📊: https://pandastutor.com/ - Learning Data Science, a free textbook on data science 📚: https://learningds.org
- Outside the classroom 👨🏫: cooking, eating out, woodworking
Course staff¶
In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:
- 1 graduate student TA: Mizuho Fukada.
- 7 undergraduate tutors: Andrew Yang, Anish Kasam, Gabriel Cha, Luran Zhang, Qirui Zheng, Sunan Xu, and Ylesia Wu.
What is data science? 🤔¶
One common definition¶
One view is that data science is about drawing useful conclusions from data using computation. For example, by:
- Using Python to explore and visualize data.
- Using simulation to make inferences about a population, given just a sample.
- Making predictions about the future given data from the past.
Let's look at a few more definitions of data science.
What is data science?¶
There isn't agreement on which "Venn Diagram" is correct!

- Why not? The field is new and rapidly developing.
- Make sure you're solid on the fundamentals, then find a niche that you enjoy.
- Read Taylor, Battle of the Data Science Venn Diagrams.
What does a data scientist do?¶
The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?

The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:
What technologies will have the biggest effect on compensation in the coming year?

What does a data scientist do?¶
Our take: we are training you to ask and answer questions using data.
As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!
Let's look at some examples of data science in practice.
An excerpt from the article:
Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.
Data science involves people 🧍¶
The decisions that we make as data scientists have the potential to impact the livelihoods of other people.
- Flu case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.
What is this course really about, then?¶
- Good data analysis is not:
- A simple application of a statistics formula.
- A simple application of computer programs.
- There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!
Course content¶
Course goals¶
We will teach you to think like a data scientist.
In this course, you will...
- Get a taste of the "life of a data scientist."
- Practice translating potentially vague questions into quantitative questions about measurable observations.
- Learn to reason about "black-box" processes (e.g. complicated models).
- Understand computational and statistical implications of working with data.
- Learn to use real data tools (and rely on documentation).
Course outcomes¶
After this course, you will...
- Be prepared for internships and data science "take home" interviews!
- Be ready to create your own portfolio of personal projects.
Topics¶
- Week 1: Introduction to
pandas
. - Week 2: DataFrames.
- Week 3: Working with messy data, hypothesis and permutation testing.
- Week 4: Missing values.
- Week 5: HTML, Midterm Exam.
- Week 6: Web and text data.
- Week 7: Text data, modeling.
- Week 8: Feature engineering and generalization.
- Week 9: Modeling in
sklearn
. - Week 10: Classifier evaluation, fairness, conclusion.
- Week 11: Final Exam
The data science lifecycle 🚴¶
The scientific method¶
You learned about the scientific method in elementary school.

However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?
The data science lifecycle¶
All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.
DataFrame Fundamentals¶
Let's do a review of pandas
fundamentals that I assume you've seen before (but may need a refresher!).
# You'll see the Path(...) / subpath syntax a lot.
# It creates the correct path to your file,
# whether you're using Windows, macOS, or Linux.
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs
Review: head
, tail
, shape
, index
, and sort_values
¶
To extract the first or last few rows of a DataFrame, use the head
or tail
methods.
dogs.head(3)
dogs.tail(2)
The shape
attribute returns the DataFrame's number of rows and columns.
dogs.shape
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index
And lastly, remember that to sort by a column, use the sort_values
method. Like most DataFrame and Series methods, sort_values
returns a new DataFrame, and doesn't modify the original.
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False)
# This sorts by 'height',
# then breaks ties by 'longevity'.
# Note the difference in the last three rows between
# this DataFrame and the one above.
dogs.sort_values(['height', 'longevity'],
ascending=False)
Note that dogs
is not the DataFrame above. To save our changes, we'd need to say something like dogs = dogs.sort_values...
.
dogs
Setting the index¶
Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index
method.
dogs.set_index('breed')
# The above cell didn't involve an assignment statement,
# so dogs was unchanged.
dogs
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs
# There used to be 7 columns, but now there are only 6!
dogs.shape
💡 Pro-Tip: Displaying more rows/columns¶
Sometimes, you just want pandas
to display a lot of rows and columns. You can use this helper function to do that. It's not important to understand how it works, feel free to just use it!
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
"""Displays n rows and cols from df."""
with pd.option_context("display.max_rows", rows,
"display.max_columns", cols):
display(df)
display_df(dogs.sort_values('weight', ascending=False),
rows=43)
Subsetting¶
We use subsetting (also called slicing) to get contiguous rows and columns out of our DataFrame.
- The standard way to select a column in
pandas
is by using the[]
operator. - Specifying a column name returns the column as a Series.
- Specifying a list of column names returns a DataFrame.
# Returns a Series.
dogs['kind']
# Returns a DataFrame.
dogs[['kind', 'size']]
# 🤔
dogs[['kind']]
# Breeds are stored in the index, which is not a column!
dogs['breed']
dogs.index
dogs
# What are the unique kinds of dogs?
dogs['kind'].unique()
# How many unique kinds of dogs are there?
dogs['kind'].nunique()
# What's the distribution of kinds?
dogs['kind'].value_counts()
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
# Tell me more about the 'weight' column.
dogs['weight'].describe()
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
# Gives us the index of the largest value, not the largest value itself.
dogs['lifetime_cost'].idxmax()
Use loc
to slice rows and columns using labels¶
loc
uses row labels and column labels.
dogs
# The first argument is the row label.
# ↓
dogs.loc['Pug', 'longevity']
# ↑
# The second argument is the column label.
As an aside, loc
is not a method – it's an indexer.
type(dogs.loc)
type(dogs.sort_values)
💡 Pro-Tip: Using Pandas Tutor¶
Pandas Tutor (developed by Sam) is a tool that you can use to visualize pandas
code. It comes with your DSC 80 environment.
You can load the extension by adding:
%reload_ext pandas_tutor
At the top of your notebook. After that, you can render visualizations with the %%pt
cell magic 🪄:
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
%%pt
dogs.loc['Pug', 'longevity']
.loc
is flexible 🧘¶
You can provide a sequence (list, array, Series) as either argument to .loc
.
dogs
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'size']
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], ['kind', 'size', 'height']]
# Note that the 'weight' column is included!
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'lifetime_cost': 'weight']
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], :]
# Shortcut for the line above.
dogs.loc[['Cocker Spaniel', 'Labrador Retriever']]
Filtering (or Querying)¶
- Filtering (aka querying) is the act of selecting rows in a DataFrame that satisfy certain condition(s).
- Comparisons with arrays (or Series) result in Boolean arrays (or Series).
- We can use comparisons along with the
loc
operator to filter a DataFrame.
dogs
dogs.loc[dogs['weight'] < 10]
dogs.loc[dogs.index.str.contains('Retriever')]
# Because querying is so common, there's a shortcut:
dogs[dogs.index.str.contains('Retriever')]
# Empty DataFrame – not an error!
dogs.loc[dogs['kind'] == 'beaver']
Note that because we set the index to 'breed'
earlier, we can select rows based on dog breeds without having to query.
dogs
# Series!
dogs.loc['Maltese']
If 'breed'
was instead a column, then we'd need to query to access information about a particular breed.
dogs_reset = dogs.reset_index()
dogs_reset
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']
Filtering with multiple conditions¶
Remember, you need parentheses around each condition. Also, you must use the bitwise operators &
and |
instead of the standard and
and or
keywords. pandas
makes weird decisions sometimes!
dogs
dogs[(dogs['weight'] < 20) & (dogs['kind'] == 'terrier')]
💡 Pro-Tip: Using .query
¶
.query
is a convenient way to query, since you don't need parentheses and you can use the and
and or
keywords.
dogs
dogs.query('weight < 20 and kind == "terrier"')
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
Don't forget iloc
!¶
iloc
stands for "integer location."iloc
is likeloc
, but it selects rows and columns based off of integer positions only, just like with 2D arrays.
dogs
dogs.iloc[1:15, :-2]
iloc
is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
# Finding the breed itself involves sorting, but not iloc.
dogs.sort_values('longevity', ascending=False).index[0]
Practice¶
Consider the DataFrame below.
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. Here's a Pandas Tutor link to visualize these!
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1,1]
Adding and modifying columns¶
Adding and modifying columns, using a copy¶
- To add a new column to a DataFrame, use the
assign
method.- To change the values in a column, add a new column with the same name as the existing column.
- Like most
pandas
methods,assign
returns a new DataFrame.- Pro ✅: This doesn't inadvertently change any existing variables.
- Con ❌: It is not very memory efficient, as it creates a new copy each time it is called.
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
dogs
💡 Pro-Tip: Method chaining¶
Chain methods together instead of writing long, hard-to-read lines.
# Finds the rows corresponding to the five cheapest to own breeds on a per-year basis.
(dogs
.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
.sort_values('cost_per_year')
.iloc[:5]
)
Adding and modifying columns, in-place¶
- You can assign a new column to a DataFrame in-place using
[]
.- This works like dictionary assignment.
- This modifies the underlying DataFrame, unlike
assign
, which returns a new DataFrame.
- This is the more "common" way of adding/modifying columns.
- ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
Note that we never reassigned dogs_copy
in the cell above – that is, we never wrote dogs_copy = ...
– though it was still modified.
Mutability¶
DataFrames, like lists, arrays, and dictionaries, are mutable. This means that they can be modified after being created. (For instance, the list .append
method mutates in-place.)
Not only does this explain the behavior on the previous slide, but it also explains the following:
dogs_copy
def cost_in_thousands():
dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
# What happens when we run this twice?
cost_in_thousands()
dogs_copy
⚠️ Avoid mutation when possible¶
Note that dogs_copy
was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!
To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy()
as the first line in functions that take DataFrames as input.
Also, some methods let you use the inplace=True
argument to mutate the original. Don't use this argument, since future pandas
releases plan to remove it.
Example: What's in a name?¶
Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'¶
This New York Times article claims that baby names beginning with "L" have become more popular over time.
Let's see if these claims are true, based on the data!
The data¶
What we're seeing below is a pandas
DataFrame. The DataFrame contains one row for every combination of 'Name'
, 'Sex'
, and 'Year'
.
baby = pd.read_csv('data/baby.csv')
baby
To get columns from a dataframe, using indexing syntax (like accessing a value from a Python dictionary).
baby['Count'].sum()
How many unique names were there per year?¶
baby.groupby('Year').size()
A shortcut to the above is as follows:
baby['Year'].value_counts()
Why doesn't the above Series actually contain the number of unique names per year?
baby[baby['Year'] == 1880]
baby[baby['Year'] == 1880].value_counts('Name')
How many babies were recorded per year?¶
baby.groupby('Year')['Count'].sum()
baby.groupby('Year')['Count'].sum().plot()
Don't worry about the code for now, we'll explain next time.
"'L' has to be like the consonant of the decade."¶
(baby
.assign(first_letter=baby['Name'].str[0])
.query('first_letter == "L"')
.groupby('Year')
['Count']
.sum()
.plot(title='Number of Babies Born with an "L" Name Per Year')
)
What about individual names?¶
(baby
.query('Name == "Siri"')
.groupby('Year')
['Count']
.sum()
.plot(title='Number of Babies Born Named "Siri" Per Year')
)
def name_graph(name):
return (baby
.query(f'Name == "{name}"')
.groupby('Year')
['Count']
.sum()
.plot(title=f'Number of Babies Born Named "{name}" Per Year')
)
name_graph('Samuel')
What about other names?¶
name_graph(...)