In [1]:
# Imports
import babypandas as bpd
import numpy as np

import plotly.express as px
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Lecture 1 – Introduction¶

DSC 10, Spring 2023¶

Welcome to DSC 10! 👋¶

  • A guided tour of data science.
    • Developed by UC Berkeley in 2015.
    • Adapted by UC San Diego in 2017.
  • Learn just enough programming and statistics to do data science.
    • Statistics without too much math, mostly simulation.
    • Lays the foundation for all other courses in the DSC major.

Agenda¶

  • Course staff.
  • What is data science?
  • How will this course run?
  • Fun demo.

Course staff¶

Instructor: Suraj Rampure (call me Suraj, pronounced "sooh-rudge")¶

  • Originally from Windsor, ON, Canada 🇨🇦.
  • BS (’20) and MS (’21) in Electrical Engineering and Computer Sciences from UC Berkeley 🐻.
  • Second year teaching in the Halıcıoğlu Data Science Institute at UCSD.
    • 4th time teaching DSC 10.
    • Also running the senior capstone program and teaching DSC 95.
    • Previously taught DSC 40A, 80, and 90.
  • Outside the classroom: travelling, watching basketball, learning to cook, watching TikTok, FaceTiming my dog 🐶, etc.

TAs and tutors¶

In addition, we have several other course staff members who are here to support you in discussion, office hours, and online.

  • 1 graduate TA: Teresa Rexin.
  • 15 undergraduate tutors: Gabriel Cha, Oren Ciolli, Sophia Fang, Doris Gao, Charlie Gillet, Raine Hoang, Vanessa Hu, Anthony Li, Jasmine Lo, Zoe Ludena, Arjun Malleswaran, Gina Roberg, Abel Seyoum, Costin Smilovici, and Suhani Sharma.

Learn more about them at dsc10.com/staff.

What is "data science"? 🤔¶

Everyone seems to have their own definition of data science.

What is "data science"?¶

Data science is about drawing useful conclusions from data using computation. Throughout the quarter, we'll touch on several aspects of data science:

  • First 4 weeks: use Python to explore data.
    • Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.
  • Next 4 weeks: use data to infer about a population, given just a sample.
    • Rely heavily on simulation, rather than formulas.
  • Last 2 weeks: use data from the past to predict what may happen in the future.
    • A taste of machine learning 🤖.

Data science is more relevant than ever 🤧¶

We've spent the last three years looking at graphs like this:

As of March 2023, both the New York Times and Johns Hopkins have stopped updating their COVID dashboards.

It can be fun, too!¶

The site The Pudding is home to several interactive data-rich articles.

(source)
(source)

The above map is called a choropleth. You will create a choropleth of your own this quarter on the Midterm Project!

Course logistics¶

Course website¶

The course website is your one-stop-shop for all things related to the course.

dsc10.com

This is where lectures, homeworks, labs, discussions, and all other content will be posted. Check it often, and read the syllabus!

Getting set up¶

  • Ed: Q&A forum. All announcements will be made here. You should have gotten email invitation; if not, there's a link on syllabus.
  • Gradescope: Where you will submit all assignments for autograding, and where all of your grades will live. You should have been automatically added; contact us if not.
  • DataHub: Where you will access and run all code in this class. Access at datahub.ucsd.edu. More on Wednesday.
  • We will not be using Canvas for anything!

In addition, you must also fill out the Welcome Survey.

Lecture¶

  • Lectures will be in-person and recorded for viewing afterwards.
    • You can attend either lecture section, as long as there is space for the students officially enrolled in that section.
    • Recordings can be found at podcast.ucsd.edu and on the course website.
  • Slides/code from lecture will be linked on the course website, both in a "runnable" code format and as an HTML file (✏️), which you can save as a PDF and annotate on your tablet.
  • We will try to make lectures engaging. Bring your laptop or tablet, if you have one.

Concept Check ✅ – Answer at cc.dsc10.com¶

What song was just played?

A. "Don't Look Down (feat. Usher)" by Martin Garrix

B. "Down (feat. Lil Wayne)" by Jay Sean

C. "Sky Is Falling" by Miguel

D. "Down (feat. Kanye West)" by Chris Brown

E. "Coming Down" by The Weeknd

(We are always going to use the same link for Concept Checks, so you should bookmark it.)

Discussion¶

  • Discussion sections are designed to give you practice with the conceptual ideas in the course.
    • All assignments in this class will be done on the computer using code, but the exams are on-paper and in-person.
    • For each discussion, we've prepared a problem set, made up of old exam problems (see practice.dsc10.com).
    • Problem sets are posted online, so bring a computer or tablet to access them. But like exams, you will answer the problems on paper.
    • Discussion problem sets aren't submitted anywhere.
  • Attendance is not required, however extra credit is provided for attending.
    • If you do attend, you can attend either section, as long as there is space for the students officially enrolled in that section.
    • Like lectures, discussions will be podcasted.

Discussion schedule¶

There are only two discussion sections:

  • A: Wednesdays, 2-2:50PM, Center Hall 214
  • B: Wednesdays, 3-3:50PM, Center Hall 214
In the Schedule of Classes, this course has both a discussion section (DI) and a lab section (LA). We are using the listed lab section for discussion, and we are not using the discussion section listed on WebReg for anything. In short, ignore what WebReg says and follow the information above.

Discussion starts this Wednesday. Discussion 1 will be focused on how to use DataHub, rather than on problem solving.

Labs¶

  • Labs refer to lab assignments, which are a required part of the course and help you develop fluency in Python and working with data.
  • While working on labs, you'll be able to run autograder tests which tell you if your answers are correct.
    • For labs, if you pass all autograder tests, you will get 100\%!
  • You must submit labs individually, but you can discuss ideas with others (no sharing code).
  • Labs are usually due on Saturdays at 11:59PM to Gradescope. The first lab (due Tuesday, April 11th) will have submission instructions.

Homeworks and projects¶

  • Weekly homework assignments build off of skills you develop in labs.
  • A key difference between homeworks and labs is that passing autograder tests does not guarantee a perfect score!
    • In homeworks, we have "hidden tests" that are only run after you submit the assignment.
    • The tests that are available to you within the assignment itself only verify that your answer is reasonable/on the right track.
  • Again, you must work on homeworks yourself, but you can discuss ideas with other students (no sharing code).
  • Homeworks are usually due on Tuesdays at 11:59PM and submitted to Gradescope.
  • In the Midterm Project and Final Project, you will do a deep dive into a dataset! Projects are longer than homeworks, so we give you more time to work on them.
    • This quarter's projects: UCSD Admissions 💯 and Meteorite Landings ☄️.
    • You can work on projects with partners, following these project partner guidelines. Both of you should actively contribute to all parts of the project.

Exams¶

We will have two exams this quarter.

  • Midterm Exam: Friday, May 5th, during your scheduled lecture time.
  • Final Exam: Saturday, June 10th, 7-10PM.
  • Both exams will be conducted in person and on paper. Let us know if you have a conflict in the Welcome Survey.

Readings and resources¶

  • We will draw readings from two sources. Readings for each lecture will be posted on the course homepage.
    • Computational and Inferential Thinking (CIT), the textbook created for Berkeley's version of this course.
    • babypandas notes, written specifically for the first part of DSC 10.
  • The Resources tab of the course website contains links to helpful resources that you'll want to use throughout the course (e.g. DSC 10 Reference Sheet, programming tutorials, supplemental videos).
  • The Debugging tab of the course website has answers to many common technical issues.

A typical week in DSC 10¶

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Nothing! 😎 Lecture Lecture Lecture
Discussion
HW due Lab due

See the Syllabus for more details.

First assignment¶

  • Lab 0 is due Tuesday, April 11th at 11:59PM.
    • Will be released tomorrow.
  • 🚨 Important: Start early and submit often.

Getting help¶

This is a tough, fast-paced course, but we're here to help you – here's how:

  • Office Hours (OH).
    • Not held in an office – rather, held in a large open study space (San Diego Supercomputer Center, 2nd floor).
    • Come with questions, or just to work!
    • See the schedule and instructions on the Calendar 📆.
    • There are a few remote office hours too, but we encourage you to come in-person.
  • Ed.
    • Post here with any logistical or conceptual questions (please don't email).
    • No code or solutions in public posts. Such posts should be private to course staff.
    • Otherwise, post publicly (anonymously, if you'd like).
  • 🚨 Important: Use these to your advantage!

Advice from previous students¶

At the end of each quarter, we ask DSC 10 students to give advice to future students in the course. Here are some responses from Winter 2023:

Start the assignments early, every time that I started an assignment the day or even night of, I always struggled and the added pressure of not getting it in on time didn't help me one bit. The times that I started a day or two in advance, even if it was just completing a couple problems in advance, I felt way more relaxed and in turn I learned and retained a lot more.

Pay attention in lectures and to begin both labs and homework early because they will pile up. The lectures are very helpful references to use if you’re stuck during labs and homework’s and office hours are incredibly useful so go!!!

Use TA's and office hours as much as possible, also the reference sheet was crucial.

Office hours are really helpful, all the tutors knew what they were doing and could were able to help me work through any of the problems I got stuck on

Collaboration¶

Asking questions is highly encouraged!¶

  • Discuss all questions with each other (except exams).
  • Submit lab assignments individually, but you can work with others (no sharing code).
  • Submit homeworks individually, but you can discuss problem-solving strategies with others (no sharing code).
  • Submit projects individually or in pairs.

The limits of collaboration:¶

  • Don't share solutions with each other or look at someone’s code.
  • Project partners should both contribute to all parts of the project. Don't split up the project.
  • Don't use ChatGPT or GitHub Copilot – all work you submit should be written by you.
  • Academic integrity violations usually result in failing the course.

We're here for you!¶

Regardless of your background, you can succeed in this course. No prior programming or statistics experience will be assumed!

Watch on YouTube: We’re All Data Scientists | Rebecca Nugent | TEDxCMU.

Campus resources¶

Counseling and Psychological Services (CAPS) is a campus unit that offers “short term counseling for academic, career, and personal issues and also offers psychiatry services for circumstances when medication can help with counseling.” If you or anyone you know is ever in need of mental health care, you should contact CAPS.

caps.ucsd.edu

Opportunity: Python bootcamp¶

Diversity in Data Science, a student organization, is running a one-week Python bootcamp specifically for students in DSC 10 with no prior programming experience. The bootcamp is this week. Sign up here.


Demo¶

Little Women (1868)¶

  • Little Women, by Louisa May Alcott, is a novel that follows the life of four sisters – Meg, Jo, Beth, and Amy.
    • A movie based on the novel was released in 2019, starring Emma Watson (Meg) and Timothée Chalamet (Laurie).
  • Using tools from this class, we'll learn (a bit) about the plot of the book, without reading it.
  • Do not worry about any of this code – we'll cover the necessary pieces in the weeks to come. Sit back and relax!
In [2]:
# Read in 'lw.txt' to a variable called little_women_text.
little_women_text = open('data/lw.txt').read()
In [3]:
# See the first three thousand characters.
little_women_text[:3000]
Out[3]:
'The Project Gutenberg EBook of Little Women, by Louisa May Alcott\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\nTitle: Little Women\n\nAuthor: Louisa May Alcott\n\nPosting Date: September 13, 2008 [EBook #514]\nRelease Date: May, 1996\n[This file last updated on August 19, 2010]\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK LITTLE WOMEN ***\n\n\n\n\nLITTLE WOMEN\n\n\nby\n\nLouisa May Alcott\n\n\n\n\nCONTENTS\n\n\nPART 1\n\n          ONE  PLAYING PILGRIMS\n          TWO  A MERRY CHRISTMAS\n        THREE  THE LAURENCE BOY\n         FOUR  BURDENS\n         FIVE  BEING NEIGHBORLY\n          SIX  BETH FINDS THE PALACE BEAUTIFUL\n        SEVEN  AMY\'S VALLEY OF HUMILIATION\n        EIGHT  JO MEETS APOLLYON\n         NINE  MEG GOES TO VANITY FAIR\n          TEN  THE P.C. AND P.O.\n       ELEVEN  EXPERIMENTS\n       TWELVE  CAMP LAURENCE\n     THIRTEEN  CASTLES IN THE AIR\n     FOURTEEN  SECRETS\n      FIFTEEN  A TELEGRAM\n      SIXTEEN  LETTERS\n    SEVENTEEN  LITTLE FAITHFUL\n     EIGHTEEN  DARK DAYS\n     NINETEEN  AMY\'S WILL\n       TWENTY  CONFIDENTIAL\n   TWENTY-ONE  LAURIE MAKES MISCHIEF, AND JO MAKES PEACE\n   TWENTY-TWO  PLEASANT MEADOWS\n TWENTY-THREE  AUNT MARCH SETTLES THE QUESTION\n\n\nPART 2\n\n  TWENTY-FOUR  GOSSIP\n  TWENTY-FIVE  THE FIRST WEDDING\n   TWENTY-SIX  ARTISTIC ATTEMPTS\n TWENTY-SEVEN  LITERARY LESSONS\n TWENTY-EIGHT  DOMESTIC EXPERIENCES\n  TWENTY-NINE  CALLS\n       THIRTY  CONSEQUENCES\n   THIRTY-ONE  OUR FOREIGN CORRESPONDENT\n   THIRTY-TWO  TENDER TROUBLES\n THIRTY-THREE  JO\'S JOURNAL\n  THIRTY-FOUR  FRIEND\n  THIRTY-FIVE  HEARTACHE\n   THIRTY-SIX  BETH\'S SECRET\n THIRTY-SEVEN  NEW IMPRESSIONS\n THIRTY-EIGHT  ON THE SHELF\n  THIRTY-NINE  LAZY LAURENCE\n        FORTY  THE VALLEY OF THE SHADOW\n    FORTY-ONE  LEARNING TO FORGET\n    FORTY-TWO  ALL ALONE\n  FORTY-THREE  SURPRISES\n   FORTY-FOUR  MY LORD AND LADY\n   FORTY-FIVE  DAISY AND DEMI\n    FORTY-SIX  UNDER THE UMBRELLA\n  FORTY-SEVEN  HARVEST TIME\n\n\n\nCHAPTER ONE\n\nPLAYING PILGRIMS\n\n"Christmas won\'t be Christmas without any presents," grumbled Jo, lying\non the rug.\n\n"It\'s so dreadful to be poor!" sighed Meg, looking down at her old\ndress.\n\n"I don\'t think it\'s fair for some girls to have plenty of pretty\nthings, and other girls nothing at all," added little Amy, with an\ninjured sniff.\n\n"We\'ve got Father and Mother, and each other," said Beth contentedly\nfrom her corner.\n\nThe four young faces on which the firelight shone brightened at the\ncheerful words, but darkened again as Jo said sadly, "We haven\'t got\nFather, and shall not have him for a long time." She didn\'t say\n"perhaps never," but each silently added it, thinking of Father far\naway, where the fighting was.\n\nNobody spoke for a minute; then Meg said in an altered tone, "You know\nthe reason Mother proposed not having any presents this Christmas was\nbecause it is going to b'
In [4]:
# Print the first three thousand characters.
print(little_women_text[:3000])
The Project Gutenberg EBook of Little Women, by Louisa May Alcott

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: Little Women

Author: Louisa May Alcott

Posting Date: September 13, 2008 [EBook #514]
Release Date: May, 1996
[This file last updated on August 19, 2010]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK LITTLE WOMEN ***




LITTLE WOMEN


by

Louisa May Alcott




CONTENTS


PART 1

          ONE  PLAYING PILGRIMS
          TWO  A MERRY CHRISTMAS
        THREE  THE LAURENCE BOY
         FOUR  BURDENS
         FIVE  BEING NEIGHBORLY
          SIX  BETH FINDS THE PALACE BEAUTIFUL
        SEVEN  AMY'S VALLEY OF HUMILIATION
        EIGHT  JO MEETS APOLLYON
         NINE  MEG GOES TO VANITY FAIR
          TEN  THE P.C. AND P.O.
       ELEVEN  EXPERIMENTS
       TWELVE  CAMP LAURENCE
     THIRTEEN  CASTLES IN THE AIR
     FOURTEEN  SECRETS
      FIFTEEN  A TELEGRAM
      SIXTEEN  LETTERS
    SEVENTEEN  LITTLE FAITHFUL
     EIGHTEEN  DARK DAYS
     NINETEEN  AMY'S WILL
       TWENTY  CONFIDENTIAL
   TWENTY-ONE  LAURIE MAKES MISCHIEF, AND JO MAKES PEACE
   TWENTY-TWO  PLEASANT MEADOWS
 TWENTY-THREE  AUNT MARCH SETTLES THE QUESTION


PART 2

  TWENTY-FOUR  GOSSIP
  TWENTY-FIVE  THE FIRST WEDDING
   TWENTY-SIX  ARTISTIC ATTEMPTS
 TWENTY-SEVEN  LITERARY LESSONS
 TWENTY-EIGHT  DOMESTIC EXPERIENCES
  TWENTY-NINE  CALLS
       THIRTY  CONSEQUENCES
   THIRTY-ONE  OUR FOREIGN CORRESPONDENT
   THIRTY-TWO  TENDER TROUBLES
 THIRTY-THREE  JO'S JOURNAL
  THIRTY-FOUR  FRIEND
  THIRTY-FIVE  HEARTACHE
   THIRTY-SIX  BETH'S SECRET
 THIRTY-SEVEN  NEW IMPRESSIONS
 THIRTY-EIGHT  ON THE SHELF
  THIRTY-NINE  LAZY LAURENCE
        FORTY  THE VALLEY OF THE SHADOW
    FORTY-ONE  LEARNING TO FORGET
    FORTY-TWO  ALL ALONE
  FORTY-THREE  SURPRISES
   FORTY-FOUR  MY LORD AND LADY
   FORTY-FIVE  DAISY AND DEMI
    FORTY-SIX  UNDER THE UMBRELLA
  FORTY-SEVEN  HARVEST TIME



CHAPTER ONE

PLAYING PILGRIMS

"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.

"It's so dreadful to be poor!" sighed Meg, looking down at her old
dress.

"I don't think it's fair for some girls to have plenty of pretty
things, and other girls nothing at all," added little Amy, with an
injured sniff.

"We've got Father and Mother, and each other," said Beth contentedly
from her corner.

The four young faces on which the firelight shone brightened at the
cheerful words, but darkened again as Jo said sadly, "We haven't got
Father, and shall not have him for a long time." She didn't say
"perhaps never," but each silently added it, thinking of Father far
away, where the fighting was.

Nobody spoke for a minute; then Meg said in an altered tone, "You know
the reason Mother proposed not having any presents this Christmas was
because it is going to b
In [5]:
# Create a variable "chapters" by splitting the text on 'CHAPTER '.
chapters = little_women_text.split('CHAPTER ') 

# Create a DataFrame with one column - the text of each chapters.
bpd.DataFrame().assign(chapters=chapters)
Out[5]:
chapters
0 The Project Gutenberg EBook of Little Women, b...
1 ONE\n\nPLAYING PILGRIMS\n\n"Christmas won't be...
2 TWO\n\nA MERRY CHRISTMAS\n\nJo was the first t...
3 THREE\n\nTHE LAURENCE BOY\n\n"Jo! Jo! Where ...
4 FOUR\n\nBURDENS\n\n"Oh, dear, how hard it does...
... ...
43 FORTY-THREE\n\nSURPRISES\n\nJo was alone in th...
44 FORTY-FOUR\n\nMY LORD AND LADY\n\n"Please, Mad...
45 FORTY-FIVE\n\nDAISY AND DEMI\n\nI cannot feel ...
46 FORTY-SIX\n\nUNDER THE UMBRELLA\n\nWhile Lauri...
47 FORTY-SEVEN\n\nHARVEST TIME\n\nFor a year Jo a...

48 rows × 1 columns

In [6]:
# Number of occurrences of each name in each chapter.

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)
counts
Out[6]:
Amy Beth Jo Meg Laurie
0 0 0 0 0 0
1 23 26 44 26 0
2 13 12 21 20 0
3 2 2 62 36 16
4 14 18 34 17 0
... ... ... ... ... ...
43 31 8 61 3 29
44 13 0 9 0 10
45 1 2 6 2 0
46 2 1 56 4 2
47 10 3 37 6 13

48 rows × 5 columns

In [7]:
# Cumulative number of times each name appears.

cumulative_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 49, 1)
)

cumulative_counts
Out[7]:
Amy Beth Jo Meg Laurie Chapter
0 0 0 0 0 0 1
1 23 26 44 26 0 2
2 36 38 65 46 0 3
3 38 40 127 82 16 4
4 52 58 161 99 16 5
... ... ... ... ... ... ...
43 619 459 1435 673 571 44
44 632 459 1444 673 581 45
45 633 461 1450 675 581 46
46 635 462 1506 679 583 47
47 645 465 1543 685 596 48

48 rows × 6 columns

In [8]:
# Putting it all together, we get a helpful visualization.
cumulative_counts_df = cumulative_counts.drop(columns=['Chapter']).to_df().melt().rename(columns={'variable': 'name', 'value': 'Count'})
cumulative_counts_df = cumulative_counts_df.assign(Chapter=list(range(1, 49)) * 5)
px.line(cumulative_counts_df, x='Chapter', y='Count', color='name', width=900, height=600, title='Cumulative Number of Times Each Name Appears', template='ggplot2')
  • In Chapter 32, Jo moves to New York alone. Her relationship with which sister suffers the most from this faraway move?
  • Laurie is a man who marries one of the sisters at the end. Which one?

Next time¶

On Wednesday, we'll start programming in Python 🐍. Remember to bring a laptop or tablet if you have one.

Discussion sections start on Wednesday as well.