In [ ]:
from dsc80_utils import *

Lecture 1 – Introduction, Data Science Lifecycle¶

Welcome to DSC 259R! 🎉

Agenda¶

  • Who are we?
  • What does a data scientist do?
  • What is this course about, and how will it run?
  • The data science lifecycle.
  • Example: What's in a name?

Instructor: Samuel Lau (call me Sam)¶

Prof. Sam Lau¶

No description has been provided for this image
  • Assistant Teaching Professor, HDSI, UCSD
  • Website: https://lau.ucsd.edu/

I design curriculum and invent tools for teaching programming and data science.

Bio: Ph.D. UCSD (2023), M.S. UC Berkeley (2018), B.S. UC Berkeley (2017).

  • My second year as a professor at UCSD 🎉
  • Pandas Tutor visualizes pandas code 📊: https://pandastutor.com/
  • Learning Data Science, a free textbook on data science 📚: https://learningds.org
  • Outside the classroom 👨‍🏫: cooking, eating out, woodworking

Course staff¶

In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:

  • 1 graduate student TA: Mizuho Fukada.
  • 7 undergraduate tutors: Andrew Yang, Anish Kasam, Gabriel Cha, Luran Zhang, Qirui Zheng, Sunan Xu, and Ylesia Wu.

What is data science? 🤔¶

What is data science?¶


No description has been provided for this image

Everyone seems to have their own definition of what data science is!

One common definition¶

One view is that data science is about drawing useful conclusions from data using computation. For example, by:

  • Using Python to explore and visualize data.
  • Using simulation to make inferences about a population, given just a sample.
  • Making predictions about the future given data from the past.

Let's look at a few more definitions of data science.

What is data science?¶

No description has been provided for this image

In 2010, Drew Conway published his famous Data Science Venn Diagram.

What is data science?¶

There isn't agreement on which "Venn Diagram" is correct!

No description has been provided for this image
  • Why not? The field is new and rapidly developing.
  • Make sure you're solid on the fundamentals, then find a niche that you enjoy.
  • Read Taylor, Battle of the Data Science Venn Diagrams.

What does a data scientist do?¶

The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?

No description has been provided for this image

The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:

What technologies will have the biggest effect on compensation in the coming year?

No description has been provided for this image

What does a data scientist do?¶

Our take: we are training you to ask and answer questions using data.

As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!

Let's look at some examples of data science in practice.

Do people care about climate change?¶

From How Americans Think About Climate Change, in Six Maps.

No description has been provided for this image
No description has been provided for this image

Do people care about climate change?¶

No description has been provided for this image
No description has been provided for this image

An excerpt from the article:

Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.

Data science involves people 🧍¶

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

  • Flu case forecasting.
  • Admissions and hiring.
  • Hyper-personalized ad recommendations.

What is this course really about, then?¶

  • Good data analysis is not:
    • A simple application of a statistics formula.
    • A simple application of computer programs.
  • There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!

Course content¶

Course goals¶

We will teach you to think like a data scientist.

In this course, you will...

  • Get a taste of the "life of a data scientist."
  • Practice translating potentially vague questions into quantitative questions about measurable observations.
  • Learn to reason about "black-box" processes (e.g. complicated models).
  • Understand computational and statistical implications of working with data.
  • Learn to use real data tools (and rely on documentation).

Course outcomes¶

After this course, you will...

  • Be prepared for internships and data science "take home" interviews!
  • Be ready to create your own portfolio of personal projects.

Topics¶

  • Week 1: Introduction to pandas.
  • Week 2: DataFrames.
  • Week 3: Working with messy data, hypothesis and permutation testing.
  • Week 4: Missing values.
  • Week 5: HTML, Midterm Exam.
  • Week 6: Web and text data.
  • Week 7: Text data, modeling.
  • Week 8: Feature engineering and generalization.
  • Week 9: Modeling in sklearn.
  • Week 10: Classifier evaluation, fairness, conclusion.
  • Week 11: Final Exam

The data science lifecycle 🚴¶

The scientific method¶

You learned about the scientific method in elementary school.

No description has been provided for this image

However, it hides a lot of complexity.

  • Where did the hypothesis come from?
  • What data are you modeling? Is the data sufficient?
  • Under which conditions are the conclusions valid?

The data science lifecycle¶

No description has been provided for this image

All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.

DataFrame Fundamentals¶

Let's do a review of pandas fundamentals that I assume you've seen before (but may need a refresher!).

Example: Dog Breeds (woof!) 🐶¶

Let's take a look at some data about dogs that comes from the American Kennel Club. Here's a cool plot made using our dataset.

No description has been provided for this image
In [ ]:
# You'll see the Path(...) / subpath syntax a lot.
# It creates the correct path to your file,
# whether you're using Windows, macOS, or Linux.
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs

Review: head, tail, shape, index, and sort_values¶

To extract the first or last few rows of a DataFrame, use the head or tail methods.

In [ ]:
dogs.head(3)
In [ ]:
dogs.tail(2)

The shape attribute returns the DataFrame's number of rows and columns.

In [ ]:
dogs.shape
In [ ]:
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index

And lastly, remember that to sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.

In [ ]:
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False)
In [ ]:
# This sorts by 'height',
# then breaks ties by 'longevity'.
# Note the difference in the last three rows between
# this DataFrame and the one above.
dogs.sort_values(['height', 'longevity'],
                 ascending=False)

Note that dogs is not the DataFrame above. To save our changes, we'd need to say something like dogs = dogs.sort_values....

In [ ]:
dogs

Setting the index¶

Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.

In [ ]:
dogs.set_index('breed')
In [ ]:
# The above cell didn't involve an assignment statement,
# so dogs was unchanged.
dogs
In [ ]:
# By reassigning dogs, our changes will persist.
dogs = dogs.set_index('breed')
dogs
In [ ]:
# There used to be 7 columns, but now there are only 6!
dogs.shape

💡 Pro-Tip: Displaying more rows/columns¶

Sometimes, you just want pandas to display a lot of rows and columns. You can use this helper function to do that. It's not important to understand how it works, feel free to just use it!

In [ ]:
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
    """Displays n rows and cols from df."""
    with pd.option_context("display.max_rows", rows,
                           "display.max_columns", cols):
        display(df)
In [ ]:
display_df(dogs.sort_values('weight', ascending=False),
           rows=43)

Subsetting¶

We use subsetting (also called slicing) to get contiguous rows and columns out of our DataFrame.

  • The standard way to select a column in pandas is by using the [] operator.
  • Specifying a column name returns the column as a Series.
  • Specifying a list of column names returns a DataFrame.
In [ ]:
# Returns a Series.
dogs['kind']
In [ ]:
# Returns a DataFrame.
dogs[['kind', 'size']]
In [ ]:
# 🤔
dogs[['kind']]
In [ ]:
# Breeds are stored in the index, which is not a column!
dogs['breed']
In [ ]:
dogs.index

Useful Series methods¶

There are a variety of useful methods that work on Series. You can see the entire list here. Many methods that work on a Series will also work on DataFrames, as we'll soon see.

In [ ]:
dogs
In [ ]:
# What are the unique kinds of dogs?
dogs['kind'].unique()
In [ ]:
# How many unique kinds of dogs are there?
dogs['kind'].nunique()
In [ ]:
# What's the distribution of kinds?
dogs['kind'].value_counts()
In [ ]:
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
In [ ]:
# Tell me more about the 'weight' column.
dogs['weight'].describe()
In [ ]:
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
In [ ]:
# Gives us the index of the largest value, not the largest value itself.
dogs['lifetime_cost'].idxmax()

Use loc to slice rows and columns using labels¶

loc uses row labels and column labels.

In [ ]:
dogs
In [ ]:
# The first argument is the row label.
#        ↓
dogs.loc['Pug', 'longevity']
#                  ↑
# The second argument is the column label.

As an aside, loc is not a method – it's an indexer.

In [ ]:
type(dogs.loc)
In [ ]:
type(dogs.sort_values)

💡 Pro-Tip: Using Pandas Tutor¶

Pandas Tutor (developed by Sam) is a tool that you can use to visualize pandas code. It comes with your DSC 80 environment.

You can load the extension by adding:

%reload_ext pandas_tutor

At the top of your notebook. After that, you can render visualizations with the %%pt cell magic 🪄:

In [ ]:
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
In [ ]:
%%pt
dogs.loc['Pug', 'longevity']

.loc is flexible 🧘¶

You can provide a sequence (list, array, Series) as either argument to .loc.

In [ ]:
dogs
In [ ]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'size']
In [ ]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], ['kind', 'size', 'height']]
In [ ]:
# Note that the 'weight' column is included!
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'lifetime_cost': 'weight']
In [ ]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], :]
In [ ]:
# Shortcut for the line above.
dogs.loc[['Cocker Spaniel', 'Labrador Retriever']]

Filtering (or Querying)¶

  • Filtering (aka querying) is the act of selecting rows in a DataFrame that satisfy certain condition(s).
  • Comparisons with arrays (or Series) result in Boolean arrays (or Series).
  • We can use comparisons along with the loc operator to filter a DataFrame.
In [ ]:
dogs
In [ ]:
dogs.loc[dogs['weight'] < 10]
In [ ]:
dogs.loc[dogs.index.str.contains('Retriever')]
In [ ]:
# Because querying is so common, there's a shortcut:
dogs[dogs.index.str.contains('Retriever')]
In [ ]:
# Empty DataFrame – not an error!
dogs.loc[dogs['kind'] == 'beaver']

Note that because we set the index to 'breed' earlier, we can select rows based on dog breeds without having to query.

In [ ]:
dogs
In [ ]:
# Series!
dogs.loc['Maltese']

If 'breed' was instead a column, then we'd need to query to access information about a particular breed.

In [ ]:
dogs_reset = dogs.reset_index()
dogs_reset
In [ ]:
# DataFrame!
dogs_reset[dogs_reset['breed'] == 'Maltese']

Filtering with multiple conditions¶

Remember, you need parentheses around each condition. Also, you must use the bitwise operators & and | instead of the standard and and or keywords. pandas makes weird decisions sometimes!

In [ ]:
dogs
In [ ]:
dogs[(dogs['weight'] < 20) & (dogs['kind'] == 'terrier')]

💡 Pro-Tip: Using .query¶

.query is a convenient way to query, since you don't need parentheses and you can use the and and or keywords.

In [ ]:
dogs
In [ ]:
dogs.query('weight < 20 and kind == "terrier"')
In [ ]:
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')

Don't forget iloc!¶

  • iloc stands for "integer location."
  • iloc is like loc, but it selects rows and columns based off of integer positions only, just like with 2D arrays.
In [ ]:
dogs
In [ ]:
dogs.iloc[1:15, :-2]

iloc is often most useful when we sort first. For instance, to find the weight of the longest-living dog breed in the dataset:

In [ ]:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
In [ ]:
# Finding the breed itself involves sorting, but not iloc.
dogs.sort_values('longevity', ascending=False).index[0]

Practice¶

Consider the DataFrame below.

In [ ]:
jack = pd.DataFrame({1: ['fee', 'fi'],
                     '1': ['fo', 'fum']})
jack

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own. Here's a Pandas Tutor link to visualize these!

In [ ]:
# jack[1]
In [ ]:
# jack[[1]]
In [ ]:
# jack['1']
In [ ]:
# jack[[1, 1]]
In [ ]:
# jack.loc[1]
In [ ]:
# jack.loc[jack[1] == 'fo']
In [ ]:
# jack[1, ['1', 1]]
In [ ]:
# jack.loc[1,1]

Adding and modifying columns¶

Adding and modifying columns, using a copy¶

  • To add a new column to a DataFrame, use the assign method.
    • To change the values in a column, add a new column with the same name as the existing column.
  • Like most pandas methods, assign returns a new DataFrame.
    • Pro ✅: This doesn't inadvertently change any existing variables.
    • Con ❌: It is not very memory efficient, as it creates a new copy each time it is called.
In [ ]:
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
In [ ]:
dogs

💡 Pro-Tip: Method chaining¶

Chain methods together instead of writing long, hard-to-read lines.

In [ ]:
# Finds the rows corresponding to the five cheapest to own breeds on a per-year basis.
(dogs
 .assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
 .sort_values('cost_per_year')
 .iloc[:5]
)

Adding and modifying columns, in-place¶

  • You can assign a new column to a DataFrame in-place using [].
    • This works like dictionary assignment.
    • This modifies the underlying DataFrame, unlike assign, which returns a new DataFrame.
  • This is the more "common" way of adding/modifying columns.
    • ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.
In [ ]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
In [ ]:
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy

Note that we never reassigned dogs_copy in the cell above – that is, we never wrote dogs_copy = ... – though it was still modified.

Mutability¶

DataFrames, like lists, arrays, and dictionaries, are mutable. This means that they can be modified after being created. (For instance, the list .append method mutates in-place.)

Not only does this explain the behavior on the previous slide, but it also explains the following:

In [ ]:
dogs_copy
In [ ]:
def cost_in_thousands():
    dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
In [ ]:
# What happens when we run this twice?
cost_in_thousands()
In [ ]:
dogs_copy

⚠️ Avoid mutation when possible¶

Note that dogs_copy was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!

To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy() as the first line in functions that take DataFrames as input.

Also, some methods let you use the inplace=True argument to mutate the original. Don't use this argument, since future pandas releases plan to remove it.

Example: What's in a name?¶

Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'¶

This New York Times article claims that baby names beginning with "L" have become more popular over time.

Let's see if these claims are true, based on the data!

The data¶

What we're seeing below is a pandas DataFrame. The DataFrame contains one row for every combination of 'Name', 'Sex', and 'Year'.

In [ ]:
baby = pd.read_csv('data/baby.csv')
baby

To get columns from a dataframe, using indexing syntax (like accessing a value from a Python dictionary).

In [ ]:
baby['Count'].sum()

How many unique names were there per year?¶

In [ ]:
baby.groupby('Year').size()

A shortcut to the above is as follows:

In [ ]:
baby['Year'].value_counts()

Why doesn't the above Series actually contain the number of unique names per year?

In [ ]:
baby[baby['Year'] == 1880]
In [ ]:
baby[baby['Year'] == 1880].value_counts('Name')

How many babies were recorded per year?¶

In [ ]:
baby.groupby('Year')['Count'].sum()
In [ ]:
baby.groupby('Year')['Count'].sum().plot()

Don't worry about the code for now, we'll explain next time.

"'L' has to be like the consonant of the decade."¶

In [ ]:
(baby
 .assign(first_letter=baby['Name'].str[0])
 .query('first_letter == "L"')
 .groupby('Year')
 ['Count']
 .sum()
 .plot(title='Number of Babies Born with an "L" Name Per Year')
)

What about individual names?¶

In [ ]:
(baby
 .query('Name == "Siri"')
 .groupby('Year')
 ['Count']
 .sum()
 .plot(title='Number of Babies Born Named "Siri" Per Year')
)
In [ ]:
def name_graph(name):
    return (baby
     .query(f'Name == "{name}"')
     .groupby('Year')
     ['Count']
     .sum()
     .plot(title=f'Number of Babies Born Named "{name}" Per Year')
    )
In [ ]:
name_graph('Samuel')

What about other names?¶

In [ ]:
name_graph(...)