Skip to main content Link Search Menu Expand Document (external link)

Overview

This course covers the principles of computing systems and tools for scaling data analytics to large datasets. Scalable analytics systems are a central part of modern data science in numerous application domains spanning enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare, digital humanities, e-governance, Internet of Things, and more.

Topics include basics of computer organization, memory hierarchy, operating systems, and cloud computing; principles of scalable and parallel data-intensive computing; design and use of parallel dataflow systems (MapReduce/Hadoop and Spark); and scaling of end-to-end machine learning (ML) workloads. It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems can be used to perform data preparation and feature engineering for ML at scale, how to scale ML model building, and how to handle data heterogeneity.

A major component of this course is hands-on Python programming to implement data exploration, data preparation, and model selection pipelines on large real-world data using scalable analytics tools and cloud resources, both Amazon Web Services (AWS) public cloud and SDSC’s private cloud.

 

Course Format

  • The class meets 2 times a week for 80-minute lectures in person.
    • All lectures will be automatically podcast here afterward.
    • Attending the lectures is not mandatory. But there are Peer Instruction activities involving discussing questions with peers in class only (details below). There will be other interactive activities as well.
    • We will use Campuswire for asynchronous discussions and questions.
  • Three Programming Assignments (PAs).
    • See the PAs page for the PA schedule and details.
    • There are no late days for the PAs. Plan your work accordingly.
  • 6 in-class activities via GradeScope.
    • They will be held in class using GradeScope, spread randomly across the quarter.
    • Each activity will have 2 multiple-choice questions (MCQ). Quantitative problems may exist but only the final answer will need to be selected. No partial credits.
    • For each question, you must first answer individually. Then you can discuss the question with you neighbor(s). After that, you can answer the question again.
    • These activities are also open books/notes/electronics/Web.
    • Grading is based on earnest participation in the whole activity.
    • If you miss an activity, you can still submit on the same day by 11:59pm PST to recoup up to 60% of the score for that activity, unless specified otherwise on Campuswire.
    • You can miss up to 2 activities out of the 6 without losing credit.
    • If you complete all of the activities, we will use the best 4 scores.
  • Midterm exam and cumulative final exam
    • The midterm exam will be held in person only. The final exam will be held in person as well. The dates and logistics are listed below.
    • The exams will have primarily multiple choice questions. Quantitative/longer problems will exist but only one final answer may need to be selected. Some questions will have partial credits.
    • The guideline for time per question is a max of 1 minute per point. The points of each question will be calibrated accordingly.
    • If you miss an exam, you will get no credit for it, unless you notify the instructor in advance with a university approved reason and receive a makeup exam slot.
    • The midterm and final are closed books/notes/electronics/web. You are allowed to keep with you two A4-sized sheets (four sides) with any content you want for the midterm, and four A4-sized sheets (eight sides) for the final. You may also use pocket calculators (i.e. the ones that don’t have non-volatile memory) as there will be some light math involved.
  • There will be three extra credit assignments delivered via Canvas only. I will announce more details on these in due course.

  • The discussion slots will be used by the TAs to give talks about the PAs. We may also use them to review discussions before the two exams.

 

Prerequisites

  • DSC 100 (Introduction to Data Management); or substantial practical experience with scalable data systems and ML, subject to the consent of the instructor.

  • Proficiency in Python programming.

 

Suggested Textbooks

  • Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the “CompOrg Book”).

  • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the “Comet Book”).

  • Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the “Cow Book”).

  • Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the “Spark Book”).

  • Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the “MLSys Book”).

 

Grading Components

  • midterm exam: 15%
  • programming assignments: 8% + 16% + 16%
  • in-class peer instruction activities: 10%
  • cumulative final: 35%
  • extra credit: 4% (likely)

 

Grading Cutoffs

The grading scheme is a hybrid of absolute and relative grading. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

GradeAbsolute Cutoff (>=)Relative Bin (Use strictest)
A+95Highest 5%
A90Next 10% (5-15)
A-85Next 15% (15-30)
B+80Next 15% (30-45)
B75Next 15% (45-60)
B-70Next 15% (60-75)
C+65Next 5% (75-80)
C60Next 5% (80-85)
C-55Next 5% (85-90)
D50Next 5% (90-95)
F<50Lowest 5%

Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. The policy for P in a P/F option is a letter grade of C- or better; for S in an S/U option is a letter grade of B- or better.

 

Exam Dates and Format

Midterm Exam: Thursday, 05/11, in class

Cumulative Final Exam: Thursday, 6/15/2023, 8:00am-10:59am; in person

Classroom Rules

  • No late days for submitting the PAs. No extensions on the final exam time window. Plan all your work well up front accordingly.

  • Students are encouraged to ask questions and participate in discussions in class and on Piazza. Please raise your hand before speaking and the instructor will call on you to speak.

  • Please review UCSD’s honor code and policies and procedures on academic integrity here. If plagiarism is detected in your code, or if we detect collusion on the graded quizzes or exams, or if any other form of academic integrity violation is identified, you will get zero for that component of your score and get downgraded substantially. I will also notify the University authorities for appropriate disciplinary action to be taken, up to and including expulsion from the University.

  • Please review UCSD’s principles of community and our commitment to creating an inclusive learning environment on this website.

  • Harassment, discrimination, or intimidation of any form against any student will not be tolerated in class or on Piazza. Please review UCSD’s policies on dealing with harassment and discrimination on this website.