import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
Welcome to DSC 80! 🎉
Agenda¶
- Who are we?
- What does a data scientist do?
- What is this course about, and how will it run?
- The data science lifecycle.
- A fun example.
About the instructor¶
Prof. Sam Lau¶
- Assistant Teaching Professor, HDSI, UCSD
- Personal: https://www.samlau.me/
I design curriculum and invent tools for teaching programming and data science.
Bio: Ph.D. UCSD (2023), M.S. UC Berkeley (2018), B.S. UC Berkeley (2017).
- My first quarter as a professor at UCSD 🎉
- Pandas Tutor visualizes
pandas
code 📊: https://pandastutor.com/ - Learning Data Science, a free textbook on data science 📚: https://learningds.org
- Outside the classroom 👨🏫: cooking, eating out, woodworking
Course staff¶
In addition to the instructor, we have 2 TAs and 7 Tutors, who are here to help you in discussion, office hours, and on Ed:
TAs:
Giorgia Nicolaou, Dylan Stockard
Tutors:
Gabriel Cha, Jiayu (John) Chen, Doris Gao, Zelong (Alan) Wang, Sunan Xu, Tiffany Yu, Luran (Lauren) Zhang
Learn more about them at dsc80.com/staff.
What is data science? 🤔¶
The DSC 10 approach¶
In DSC 10, we told you that data science is about drawing useful conclusions from data using computation.
In DSC 10, you:
- Used Python to explore and visualize data.
- Used simulation to make inferences about a population, given just a sample.
- Made predictions about the future given data from the past.
Let's look at a few more definitions of data science.
What is data science?¶
There isn't agreement on which "Venn Diagram" is correct!
- Why not? The field is new and rapidly developing.
- Make sure you're solid on the fundamentals, then find a niche that you enjoy.
- Read Kolassa, Battle of the Data Science Venn Diagrams.
What does a data scientist do?¶
The chart below is taken from the 2016 Data Science Salary Survey, administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice?
The chart below is taken from the followup 2021 Data/AI Salary Survey, also administered by O'Reilly. They asked respondents:
What technologies will have the biggest effect on compensation in the coming year?
What does a data scientist do?¶
My take: in DSC 80, and in the DSC major more broadly, we are training you to ask and answer questions using data.
As you take more courses, we're training you to answer questions whose answers are ambiguous – this uncertainly is what makes data science challenging!
Let's look at some examples of data science in practice.
Data science involves people 🧍¶
The decisions that we make as data scientists have the potential to impact the livelihoods of other people.
- Flu case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.
What this course is about:¶
Good data analysis is not:
- A simple application of a statistics formula.
- A simple application of computer programs.
There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking – that's where you come in!
Course content¶
Course goals¶
DSC 80 teaches you to think like a data scientist.
In this course, you will...
- Practice translating potentially vague questions into quantitative questions about measurable observations.
- Learn to reason about "black-box" processes (e.g. complicated models).
- Understand computational and statistical implications of working with data.
- Learn to use real data tools (and rely on documentation).
- Get a taste of the "life of a data scientist."
Course outcomes¶
After this course, you will...
- Be prepared for internships and data science "take home" interviews!
- Be ready to create your own portfolio of personal projects.
- Have the background and maturity to succeed in the upper-division.
Topics¶
- Week 1: From
babypandas
topandas
- Week 2: DataFrames
- Week 3: Working with messy data, hypothesis testing
- Week 4: Missing values
- Week 5: HTML, Midterm Exam
- Week 6: Web data
- Week 7: Text data, modeling
- Week 8: Feature engineering,
sklearn
basics - Week 9:
sklearn
pipelines and model evaluation - Week 10: Classifier evaluation, fairness
- Week 11: Final Exam
Course logistics¶
Enrollment¶
146 enrolled, 17 on waitlist.
I can't help waitlisted people! But if you stick around, typically 5-10% of people drop course within first two weeks.
Getting set up¶
- Ed: Q&A forum. Must be active here, since this is where all announcements will be made.
- Gradescope: Where you will submit all assignments for autograding, and where all of your grades will live.
- Canvas: no.
In addition, you must fill out our Welcome Survey.
Accessing course content on GitHub¶
You will access all course content by pulling the course GitHub repository:
We will post HTML versions of lecture notebooks on the course website, but otherwise you must git pull
from this repository to access all course materials (including blank copies of assignments).
Environment setup¶
- You have two choices:
- Set up your own Python environment (strongly recommended).
- Use DataHub.
- Either way, follow the instructions on the Tech Support page of the course website.
- Once you set up your environment, you will pull the course repo every time a new assignment comes out.
- Note: You will submit your work to Gradescope directly, without using Git.
- Will post a demo video with Lab 1.
Assignments¶
In this course, you will learn by doing!
- Labs (30%): 9 total. Due weekly on Mondays
- Projects (35% + 5% checkpoints): 5 total. Due on Wednesdays, and usually have a "checkpoint."
In DSC 80, assignments will usually consist of both a Jupyter Notebook and a .py
file. You will write your code in the .py
file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.
Discussions and lab reflections¶
In order to have you reflect on your lab work, we will offer extra credit each week if you do all 3 of the following:
- Submit the lab.
- Attend discussion in-person (Fridays 10-10:50AM in Center Hall 212), where discuss solutions to the most recent lab.
- Submit a lab reflection form to Gradescope by Saturday.
Each week you do all 3, you'll earn 0.25% of extra credit – this could total 2%.
This scheme starts this week. Discussion will be podcasted.
Exams¶
- Midterm Exam (10%): Thursday, Nov 2, in-person during lecture.
- Final Exam (20%): Monday, Dec 11, 3-6PM, in-person (location TBD).
- Your final exam score can redeem your midterm score (see Syllabus for details).
- Let us know on the Exam Accommodations Form if you have a conflict.
A typical week in DSC 80¶
Resources¶
- Your main resource will be lecture notebooks.
- Most lectures also have supplemental readings that come from our course textbook, Learning Data Science.
Using Other People's Code¶
- Lots of code available online --- Stack Overflow, documentation, ChatGPT, etc.
- Use it all! I trust that you're here to learn.
- (Just don't copy off a friend.)
Support 🫂¶
It is no secret that this course requires a lot of work - becoming fluent with working with data is hard!
- You will learn how to solve problems independently – documentation and the internet will be your friends.
- Learning how to effectively check your work and debug is extremely useful.
- Learning to stick with a problem (tenacity) is a very valuable skill; but don't be afraid to ask for help.
Once you've tried to solve problems on your own, we're glad to help.
- Office hours are offered – most are in-person, but a few are remote. See the Calendar 📆 for details.
- Ed is your friend too. Make your conceptual questions public, and make your debugging questions private.
The data science lifecycle 🚴¶
The scientific method¶
You learned about the scientific method in elementary school.
However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?
The data science lifecycle¶
All steps lead to more questions! We'll refer back to the data science lifecycle repeatedly throughout the quarter.
Example: What's in a name?¶
Lilith, Lilibet … Lucifer? How Baby Names Went to ‘L’
https://www.nytimes.com/2021/06/12/style/lilibet-popular-baby-names.html?amp=&smid=fb-nytimes
Goal: See if claims are true, based on the data.
The Data¶
baby = pd.read_csv('data/baby.csv')
baby
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 20456 | 2022 |
1 | Noah | M | 18621 | 2022 |
2 | Olivia | F | 16573 | 2022 |
... | ... | ... | ... | ... |
2085155 | Wright | M | 5 | 1880 |
2085156 | York | M | 5 | 1880 |
2085157 | Zachariah | M | 5 | 1880 |
2085158 rows × 4 columns
baby.groupby('Year').sum().plot();
"'L' has to be like the consonant of the decade."¶
(baby
.assign(first_letter=baby['Name'].str[0])
.query('first_letter == "L"')
.groupby('Year')
.sum()
.plot()
);
What about individual names?¶
(baby
.query('Name == "Luna"')
.groupby('Year')
.sum()
.plot()
);
(baby
.query('Name == "Siri"')
.groupby('Year')
.sum()
.plot()
);
Things that are due¶
- OSD / Exam Conflict requests due tonight
- Welcome survey due Thursday
- Lab 1 released, due Monday!
Next time¶
- A deep dive into
pandas
.