from dsc80_utils import *
I build metrics for self-driving cars, and lecture in my spare time (sorry to make you come in at 5 pm!)
Bio: Ph.D. UCSD (2019), B.S. Penn State (2014).
In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:
Learn more about them at dsc80.com/staff.
In DSC 10, we told you that data science is about drawing useful conclusions from data using computation. In DSC 10, you:
Let's look at a few more definitions of data science.
There isn't agreement on which "Venn Diagram" is correct!
The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?
The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:
What technologies will have the biggest effect on compensation in the coming year?
Our take: in DSC 80, and in the DSC major more broadly, we are training you to ask and answer questions using data.
As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!
Let's look at some examples of data science in practice.
An excerpt from the article:
Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.
The decisions that we make as data scientists have the potential to impact the livelihoods of other people.
DSC 80 teaches you to think like a data scientist.
In this course, you will...
After this course, you will...
babypandas
to pandas
.sklearn
.In addition, you must fill out our Welcome Survey
You will access all course content by pulling the course GitHub repository:
We will post HTML versions of lecture notebooks on the course website, but otherwise you must git pull
from this repository to access all course materials (including blank copies of assignments).
git pull
the course repo every time a new assignment comes out.New for this quarter: Assignment deadlines are fairly flexible (as I'll explain soon). To help yourself stay on track with material, you can opt into lecture attendance. If you do, lecture attendance is worth 5% of your overall grade (instead of 0%) and the midterm and final are worth 2.5% less.
To get credit for a class, attend and participate in the activities for both lectures. Lowest two classes dropped.
In this course, you will learn by doing!
In DSC 80, assignments will usually consist of both a Jupyter Notebook and a .py
file. You will write your code in the .py
file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.
Monday: Lab due
Tuesday: Lecture
Wednesday: Discussion, Project due
Thursday: Lecture
Friday: Lab due
🏃♂️💨💨💨
It is no secret that this course requires a lot of work – becoming fluent with working with data is hard!
Once you've tried to solve problems on your own, we're glad to help.
You learned about the scientific method in elementary school.
However, it hides a lot of complexity.
All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.
This New York Times article claims that baby names beginning with "L" have become more popular over time.
Let's see if these claims are true, based on the data!
What we're seeing below is a pandas
DataFrame. The DataFrame contains one row for every combination of 'Name'
, 'Sex'
, and 'Year'
.
baby = pd.read_csv('data/baby.csv')
baby
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2085158 rows × 4 columns
Recall from DSC 10, to access columns in a DataFrame, you used the .get
method.
baby.get('Count').sum()
365296191
Everything you learned in babypandas
translates to pandas
. However, the more common way of accessing a column in pandas
involves dictionary syntax:
baby['Count'].sum()
365296191
You'll learn more about this in Thursday's lecture.
baby.groupby('Year').count()
Name | Sex | Count | |
---|---|---|---|
Year | |||
1880 | 2000 | 2000 | 2000 |
1881 | 1934 | 1934 | 1934 |
1882 | 2127 | 2127 | 2127 |
... | ... | ... | ... |
2020 | 31517 | 31517 | 31517 |
2021 | 31685 | 31685 | 31685 |
2022 | 31915 | 31915 | 31915 |
143 rows × 3 columns
A shortcut to the above is as follows:
baby['Year'].value_counts()
2008 35094 2007 34966 2009 34724 ... 1883 2084 1880 2000 1881 1934 Name: Year, Length: 143, dtype: int64
Why doesn't the above Series actually contain the number of unique names per year?
baby[(baby['Year'] == 1880)]
Name | Sex | Count | Year | |
---|---|---|---|---|
2083158 | John | M | 9655 | 1880 |
2083159 | William | M | 9532 | 1880 |
2083160 | Mary | F | 7065 | 1880 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2000 rows × 4 columns
baby[(baby['Year'] == 1880)].value_counts('Name')
Name Grace 2 Emma 2 Clair 2 .. Evaline 1 Evalena 1 Zula 1 Length: 1889, dtype: int64
baby.groupby('Year').sum()
/var/folders/02/pnxb037d1s55wxx4bpzhn2s00000gn/T/ipykernel_25735/2954294721.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. baby.groupby('Year').sum()
Count | |
---|---|
Year | |
1880 | 201484 |
1881 | 192690 |
1882 | 221533 |
... | ... |
2020 | 3333981 |
2021 | 3379713 |
2022 | 3361896 |
143 rows × 1 columns
baby.groupby('Year').sum().plot()
/var/folders/02/pnxb037d1s55wxx4bpzhn2s00000gn/T/ipykernel_25735/418023659.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. baby.groupby('Year').sum().plot()
(baby
.assign(first_letter=baby['Name'].str[0])
.query('first_letter == "L"')
.groupby('Year')
.sum()
.plot(title='Number of Babies Born with an "L" Name Per Year')
)
/var/folders/02/pnxb037d1s55wxx4bpzhn2s00000gn/T/ipykernel_25735/2279971940.py:5: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
(baby
.query('Name == "Siri"')
.groupby('Year')
.sum()
.plot(title='Number of Babies Born Named "Siri" Per Year')
)
/var/folders/02/pnxb037d1s55wxx4bpzhn2s00000gn/T/ipykernel_25735/2012644558.py:4: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
def name_graph(name):
return (baby
.query(f'Name == "{name}"')
.groupby('Year')
.sum()
.plot(title=f'Number of Babies Born Named "{name}" Per Year')
)
name_graph('Brendan')
/var/folders/02/pnxb037d1s55wxx4bpzhn2s00000gn/T/ipykernel_25735/2742478432.py:5: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Visit http://q.dsc80.com/ to respond.
name_graph(...)
/var/folders/02/pnxb037d1s55wxx4bpzhn2s00000gn/T/ipykernel_25735/2742478432.py:5: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
pandas
.