Lecture 1 – Introduction, Data Science Lifecycle

DSC 80, Spring 2022

Welcome to DSC 80! 🎉

Agenda

About the instructor

Suraj Rampure (call me Suraj, pronounced “soo-rudge”)

Me with my mom, my dog, and my friend over spring break.

Course staff

In addition to the instructor, we have several other course staff members who are here to support you in discussion, office hours, and Campuswire.

Learn more about them at dsc80.com/staff.

What is data science? 🤔

What is data science?


Everyone seems to have their own definition of what data science is.

The DSC 10 definition

In DSC 10, we told you that science is about drawing useful conclusions from data using computation.

Let's look at some other definitions.

What is data science?

In 2010, Drew Conway published his famous Data Science Venn Diagram.

What is data science?

There isn't agreement on which "Venn Diagram" is correct!

What does a data scientist do?

In 2016, O'Reilly administered a Data Scientice Salary Survey. Part of the survey asked self-identified data scientists what tasks they do on a regular basis.

What do you notice?

What does a data scientist do?

My take: in DSC 80, and in the DSC major more broadly, we are equipping you to ask and answer questions using data.

Let's look at some examples of data science in practice.

Moving average of the average number of guesses taken for each Wordle word, based on patterns shared on Twitter. (source)

Number of Wordle patterns shared per day on Twitter. (source)

Forecasting COVID cases

Results of the UCSD_NEU-DeepGLEAM COVID cases forecasting model for the upcoming week (source).

Forecasting COVID cases

Evaluation of case forecasts showed that more reported cases than expected fell outside the forecast prediction intervals for extended periods of time. Given this low reliability, COVID-19 case forecasts will no longer be posted by the Centers for Disease Control and Prevention. - CDC.gov

Depixelizer

A "Face Depixelizer" released in 2020 takes pixelated images and generates images that are perceptually realistic and downscale correctly.

What happened here? Why do you think this happened?

Data science involves people 🧍

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

Warning! ⚠️

“The purpose of computing is insight, not numbers.” - R. Hamming. Numerical Methods for Scientists and Engineers (1962).

Course content

Course goals

In this course, you will...

Course outcomes

After this course, you will...

Topics

This course was desgined by a former data scientist at Amazon (Aaron Fraenkel). As such, you'll be learning skills that you need to know as a data scientist.

Course logistics

Course website

The course website is your one-stop-shop for all things related to the course.


dsc80.com


Make sure to read the syllabus!

Getting set up

In addition, you must also fill out our Welcome + Alternate Exams Form.

Accessing course content on GitHub

You will access all course content by pulling the course GitHub repository:


github.com/dsc-courses/dsc80-2022-sp


We will post HTML versions of lecture notebooks on the course website, but otherwise you must pull from this repository to access all course materials (including blank copies of assignments).

Environment setup

Course meetings

Assignments

In this course, you will learn by doing!

In DSC 80, assignments will usually consist of both a Jupyter Notebook and a .py file. You will write your code in the .py file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.

Exams

Resources

Support 🫂

It is no secret that this course requires a lot of work - becoming fluent with working with data is hard!

Once you've tried to solve problems on your own, we're glad to help.

The data science lifecycle 🚴

The scientific method

You learned about the scientific method in elementary school.

However, it hides a lot of complexity.

The data science lifecycle

All steps lead to more questions!

Example: San Diego employee salaries

Research Domain and Questions

We have our domain – City of San Diego employee salaries. What are some questions we might want to ask?

Context

Why is this dataset relevant?

Find and Clean Data

Initial look at the data

Aside on privacy and ethics

Data cleaning

Empirical distribution of salaries

Let's plot the distribution of salaries.

Discussion Question

Which of the following best describe the distribution of San Diego employee salaries?

🙋 To answer, go to yellkey.com/job.

Empirical distribution of salaries

Let's draw the distribution of salaries separately for part-time and full-time employees.

Question: Does gender influence pay?

Social Security Administration baby names 👶

We began compiling the baby name list in 1997, with names dating back to 1880. At the time of a child’s birth, parents supply the name to us when applying for a child’s Social Security card, thus making Social Security America’s source for the most popular baby names. Please share this with your friends and family—and help us spread the word on social media. - Social Security’s Top Baby Names for 2020

Exploring names

Data Modeling

Determining the most common gender for each name

Determining the most common gender for each name

Adding a 'gender' column to salaries

This involves two steps:

  1. Extracting just the first name from 'Employee Name'.
  2. Merging salaries and genders.

Predictions and Inference

Question: Does gender influence pay?

This was our original question. Let's find out!

A hypothesis test

Strategy:

Running the hypothesis test

Next time