In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats

set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 1 – Introduction, Data Science Lifecycle¶

DSC 80, Winter 2024¶

Welcome to DSC 80! 🎉

Agenda¶

Who are we?
What does a data scientist do?
What is this course about, and how will it run?
The data science lifecycle.
Example: What's in a name?

Instructor: Suraj Rampure (call me Suraj, pronounced "sooh-rudge")¶

Originally from Windsor, ON, Canada 🇨🇦.
BS and MS in Electrical Engineering and Computer Sciences from UC Berkeley 🐻.
Third year teaching in the Halıcıoğlu Data Science Institute at UCSD.
- 3rd time teaching DSC 80.
- Also running the senior capstone program for the second time.
- Previously taught DSC 10, 40A, 90, and 95.
Outside interests: traveling, hiking, eating out, watching basketball, FaceTiming my dog 🐶, etc.

No description has been provided for this image

Course staff¶

In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:

1 graduate TA: Dylan Stockard.
9 undergraduate tutors: Gabriel Cha, Aritra Das, Weiyue Li, Jasmine Lo, Harshita Saha, Ethan Shapiro, Yutian Shi, Tiffany Yu, Diego Zavalza.

Learn more about them at dsc80.com/staff.

What is data science? 🤔¶

What is data science?¶

Everyone seems to have their own definition of what data science is!

The DSC 10 approach¶

In DSC 10, we told you that data science is about drawing useful conclusions from data using computation. In DSC 10, you:

Used Python to explore and visualize data.

Used simulation to make inferences about a population, given just a sample.

Made predictions about the future given data from the past.

Let's look at a few more definitions of data science.

What is data science?¶

In 2010, Drew Conway published his famous Data Science Venn Diagram.

What is data science?¶

There isn't agreement on which "Venn Diagram" is correct!

Why not? The field is new and rapidly developing.
Make sure you're solid on the fundamentals, then find a niche that you enjoy.
Read Taylor, Battle of the Data Science Venn Diagrams.

What does a data scientist do?¶

The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?

The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:

What technologies will have the biggest effect on compensation in the coming year?

What does a data scientist do?¶

My take: in DSC 80, and in the DSC major more broadly, we are training you to ask and answer questions using data.

As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!

Let's look at some examples of data science in practice.

Do people care about climate change?¶

From How Americans Think About Climate Change, in Six Maps.

Do people care about climate change?¶

An excerpt from the article:

Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.

Data science involves people 🧍¶

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

Flu case forecasting.
Admissions and hiring.
Hyper-personalized ad recommendations.

What is this course really about, then?¶

Good data analysis is not:
- A simple application of a statistics formula.
- A simple application of computer programs.

There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!

Course content¶

Course goals¶

DSC 80 teaches you to think like a data scientist.

In this course, you will...

Get a taste of the "life of a data scientist."
Practice translating potentially vague questions into quantitative questions about measurable observations.
Learn to reason about "black-box" processes (e.g. complicated models).
Understand computational and statistical implications of working with data.
Learn to use real data tools (and rely on documentation).

Course outcomes¶

After this course, you will...

Be prepared for internships and data science "take home" interviews!
Be ready to create your own portfolio of personal projects.
Have the background and maturity to succeed in the upper-division.

Topics¶

Week 1: From babypandas to pandas.
Week 2: DataFrames.
Week 3: Working with messy data, hypothesis and permutation testing.
Week 4: Missing values.
Week 5: HTML, Midterm Exam.
Week 6: Web and text data.
Week 7: Text data, modeling.
Week 8: Feature engineering and generalization.
Week 9: Modeling in sklearn.
Week 10: Classifier evaluation, fairness, conclusion.
Week 11: Final Exam

Course logistics¶

Course website¶

The course website is your one-stop-shop for all things related to the course.

dsc80.com

Make sure to read the brand-new syllabus!

Getting set up¶

Ed: Q&A forum. Must be active here, since this is where all announcements will be made.
Gradescope: Where you will submit all assignments for autograding, and where all of your grades will live.
Canvas: No ❌.

In addition, you must fill out our Welcome Survey.

Accessing course content on GitHub¶

You will access all course content by pulling the course GitHub repository:

github.com/dsc-courses/dsc80-2024-wi

We will post HTML versions of lecture notebooks on the course website, but otherwise you must git pull from this repository to access all course materials (including blank copies of assignments).

Environment setup¶

You're required to set up a Python environment on your own computer.
To do so, follow the instructions on the Tech Support page of the course website.
Once you set up your environment, you will git pull the course repo every time a new assignment comes out.
Note: You will submit your work to Gradescope directly, without using Git.
We will try to post a demo video with Lab 1, and we'll help you with this in Discussion 1 tomorrow.

Lectures¶

Lectures are held in-person on Tuesdays and Thursdays from 3:30-4:50PM in Pepper Canyon Hall 109. Attendance is not required, but is encouraged. Lectures are podcasted.

New for this quarter: Some lectures, like this Thursday's lecture, will have pre-lecture readings that we release before lecture.
- They should only take ~20 minutes to complete.
- The idea is that by getting the introductory material out of the way, we can spend valuable class time on problem solving.
- We'll make an Ed announcement any time there's a pre-lecture reading, and will expect you to complete it.
- This is new and experimental; we'll tweak things as the quarter goes on if necessary.

Also: This Thursday's lecture will be on Zoom, since I'll be out of town presenting at a conference.

Assignments¶

In this course, you will learn by doing!

Labs (25%): 9 total, lowest score dropped. Usually due weekly on Mondays at 11:59PM, except when Monday is a holiday (like next week).
Projects (30% + 5% checkpoints): 5 total, no drops. Usually due on Thursdays at 11:59PM, and usually have a "checkpoint."

In DSC 80, assignments will usually consist of both a Jupyter Notebook and a .py file. You will write your code in the .py file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.

Discussions and lab reflections¶

In order to have you reflect on your lab work, we will offer extra credit each week if you do all 3 of the following:

Submit the lab.
Attend discussion in-person on Wednesdays from 7-7:50PM in Pepper Canyon Hall 109, where we discuss solutions to the most recent lab.
Submit a lab reflection form to Gradescope by Thursday at 11:59PM.

Each week you do all 3, you'll earn 0.2% of extra credit – this could total 2% of extra credit.

Just for tomorrow's discussion, since there hasn't been a lab due yet, we'll give everyone who attends discussion the 0.2% extra credit.

Exams¶

Midterm Exam (15%): Thursday, February 8th, in-person during lecture.
Final Exam (25%): Tuesday, March 19th, 3-6PM, in-person (location TBD).
Your final exam score can redeem your midterm score (see the Syllabus for details).
Let us know on the Welcome Survey if you have a conflict.

A typical week in DSC 80¶

Resources¶

Your main resource will be lecture notebooks.
Most lectures also have supplemental readings that come from our course textbook, Learning Data Science. These are not required, but are highly recommended.

Support 🫂¶

It is no secret that this course requires a lot of work – becoming fluent with working with data is hard!

You will learn how to solve problems independently – documentation and the internet will be your friends.
Learning how to effectively check your work and debug is extremely useful.
Learning to stick with a problem (tenacity) is a very valuable skill; but don't be afraid to ask for help.

Once you've tried to solve problems on your own, we're glad to help.

We have several office hours in person each week. See the Calendar 📆 for details.
Ed is your friend too. Make your conceptual questions public, and make your debugging questions private.

Generative Artificial Intelligence¶

We know that tools, like ChatGPT and GitHub Copilot, can write code for you.
Feel free to use such tools with caution. Refer to the Generative AI section of the syllabus for details.
We trust that you're here to learn and do the work for yourself.
You won't be able to use ChatGPT on the exams, which are in-person and on paper, so make sure you understand how your code actually works.

You'll have to work a lot, but we'll make the time spent worth it.

The data science lifecycle 🚴¶

The scientific method¶

You learned about the scientific method in elementary school.

However, it hides a lot of complexity.

Where did the hypothesis come from?
What data are you modeling? Is the data sufficient?
Under which conditions are the conclusions valid?

The data science lifecycle¶

All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.

Example: What's in a name?¶

Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'¶

This New York Times article claims that baby names beginning with "L" have become more popular over time.

Let's see if these claims are true, based on the data!

The data¶

What we're seeing below is a pandas DataFrame. The DataFrame contains one row for every combination of 'Name', 'Sex', and 'Year'.

In [2]:

baby = pd.read_csv('data/baby.csv')
baby

Out[2]:

	Name	Sex	Count	Year
0	Liam	M	20456	2022
1	Noah	M	18621	2022
2	Olivia	F	16573	2022
...	...	...	...	...
2085155	Wright	M	5	1880
2085156	York	M	5	1880
2085157	Zachariah	M	5	1880

2085158 rows × 4 columns

Recall from DSC 10, to access columns in a DataFrame, you used the .get method.

In [3]:

baby.get('Count').sum()

Out[3]:

365296191

Everything you learned in babypandas translates to pandas. However, the more common way of accessing a column in pandas involves dictionary syntax:

In [4]:

baby['Count'].sum()

Out[4]:

365296191

You'll learn more about this in Thursday's lecture and pre-lecture reading.

How many unique names were there per year?¶

In [5]:

baby.groupby('Year').count()

Out[5]:

	Name	Sex	Count
Year
1880	2000	2000	2000
1881	1934	1934	1934
1882	2127	2127	2127
...	...	...	...
2020	31517	31517	31517
2021	31685	31685	31685
2022	31915	31915	31915

143 rows × 3 columns

A shortcut to the above is as follows:

In [6]:

baby['Year'].value_counts()

Out[6]:

2008    35094
2007    34966
2009    34724
        ...  
1883     2084
1880     2000
1881     1934
Name: Year, Length: 143, dtype: int64

Why doesn't the above Series actually contain the number of unique names per year?

In [7]:

baby[(baby['Year'] == 1880)]

Out[7]:

	Name	Sex	Count	Year
2083158	John	M	9655	1880
2083159	William	M	9532	1880
2083160	Mary	F	7065	1880
...	...	...	...	...
2085155	Wright	M	5	1880
2085156	York	M	5	1880
2085157	Zachariah	M	5	1880

2000 rows × 4 columns

In [8]:

baby[(baby['Year'] == 1880)].value_counts('Name')

Out[8]:

Name
Grace      2
Emma       2
Clair      2
          ..
Evaline    1
Evalena    1
Zula       1
Length: 1889, dtype: int64

How many babies were recorded per year?¶

In [9]:

baby.groupby('Year').sum()

Out[9]:

	Count
Year
1880	201484
1881	192690
1882	221533
...	...
2020	3333981
2021	3379713
2022	3361896

143 rows × 1 columns

In [10]:

baby.groupby('Year').sum().plot();

"'L' has to be like the consonant of the decade."¶

In [11]:

(baby
 .assign(first_letter=baby['Name'].str[0])
 .query('first_letter == "L"')
 .groupby('Year')
 .sum()
 .plot(title='Number of Babies Born with an "L" Name Per Year')
);

What about individual names?¶

In [12]:

(baby
 .query('Name == "Siri"')
 .groupby('Year')
 .sum()
 .plot(title='Number of Babies Born Named "Siri" Per Year')
);

In [13]:

def name_graph(name):
    (baby
     .query(f'Name == "{name}"')
     .groupby('Year')
     .sum()
     .plot(title=f'Number of Babies Born Named "{name}" Per Year')
    )

In [14]:

name_graph('Suraj')

What about our names?¶

If this code doesn't run on your machine, run mamba install ipywidgets in your Terminal after setting up your environment.

In [15]:

from ipywidgets import widgets
from IPython.display import clear_output

In [16]:

# The first names of everyone in the class!
class_first = np.load('data/names.npy', allow_pickle=True)
class_first

Out[16]:

array(['Aakash', 'Aile', 'Ailinna', 'Akshay', 'Allen', 'Amelia', 'Andrea',
       'Angel', 'Angela', 'Anish', 'Aritra', 'Bingyan', 'Brandon',
       'Brendan', 'Cecilia', 'Chengxi', 'Chenlong', 'Chenxi', 'Chi-En',
       'Colin', 'Daniel', 'David', 'Dawson', 'Deepika', 'Diego', 'Diya',
       'Dylan', 'Ellie', 'Eshaan', 'Ethan', 'Gabriel', 'Hana', 'Hannah',
       'Harshita', 'Idhant', 'Ishaan', 'Jack', 'Jasmine', 'Jason',
       'Jesse', 'Jessica', 'Jialin', 'Jiaye', 'Jimmy', 'Jonathan',
       'Jordan', 'Jovanna', 'Joyce', 'Kailey', 'Kaitly', 'Kaiwen',
       'Katelyn', 'Kening', 'Krish', 'Krystal', 'Kyra', 'Lacha', 'Laura',
       'Leena', 'Levy', 'Lin', 'Liuyang', 'Marcus', 'Mark', 'Matilda',
       'Max', 'Megha', 'Mehak', 'Mihir', 'Minghan', 'Monica', 'Nadine',
       'Nancy', 'Natasha', 'Nian-Nian', 'Nida', 'Niha', 'Noah', 'Pansy',
       'Raymond', 'Rishaant', 'Rukun', 'Rushil', 'Samuel', 'Sarah',
       'Seanna', 'Sebastian', 'Shir', 'Sia', 'Sivaram', 'Stephanie',
       'Subika', 'Suhani', 'Suraj', 'Tatiana', 'Tianqi', 'Tiffany',
       'Utkarsh', 'Varun', 'Wan-Rong', 'Weijie', 'Weiyue', 'Yash',
       'Yeogyeong', 'Yihui', 'Yiran', 'Yujin', 'Yujun', 'Yutian',
       'Zening', 'Zhenghao', 'Zhihan', 'Zoey'], dtype=object)

In [17]:

dropdown_names = widgets.Dropdown(options=class_first, value='Suraj')

def dropdown_names_handler(change):
    if change['name'] == 'value' and (change['new'] != change['old']):
        clear_output()
        display(dropdown_names)
        name_graph(change['new'])
        
display(dropdown_names)
name_graph('Suraj')
dropdown_names.observe(dropdown_names_handler)

Dropdown(index=93, options=('Aakash', 'Aile', 'Ailinna', 'Akshay', 'Allen', 'Amelia', 'Andrea', 'Angel', 'Ange…

This week...¶

On Thursday, we'll do a deep dive into pandas.
- Be on the lookout for Thursday's pre-lecture reading.
- Remember, Thursday's lecture is on Zoom. I'll send the Zoom link on Ed tomorrow.
Lab 1 will be released by tomorrow.
Come to discussion tomorrow for help setting up your environment, which you'll need to do before working on Lab 1.
Also fill out the Welcome Survey and read the Syllabus!