from dsc80_utils import *
Announcements 📣¶
- Last day for all redemptions is tomorrow.
- If at least 80% of the class fills out both SETs and the End-of-Quarter Survey by tomorrow, June 7th at 11:59PM, then everyone will earn an extra 2% on the Final Exam.
- The Final Exam is on Saturday, June 8th from 8AM-11AM in CENTER 216.
- Practice by working through old exams at practice.dsc80.com.
- You can bring two double-sided notes sheets (you can bring your midterm notes sheet, if you want).
- Check Ed for more details.
- The Final Project is due on Wednesday, June 12th.
- No slip days allowed!
Building a Data Science Career¶
Do Grades Matter?¶
- Not as much as you probably think.
What you might imagine:
- Student works hard to get As in their DSC classes.
- High GPA looks good on resume.
- Resume leads to interview, student aces interview.
- Good job!
But this is what I actually see:
- Student works hard to get As in ALL OF THEIR classes and takes a minor / double major.
- High GPA looks decent on resume, but lots of students have high GPAs.
- Resume doesn't stand out very much, and student has trouble finding job after graduation.
An Alternate Route¶
- Student takes fewer classes in order to do AMAZING work in one class (e.g. really interested in data visualization).
- Professor: Wow, that was the best class work in YEARS, you should totally join my research group!
- Student works with PhD students and professor on cutting-edge research, years ahead of what's available in industry.
- Student's resume is packed with interesting projects that VERY FEW other students have, leading to an AWESOME JOB after graduation.
- Or, student applies to grad school and their professor can write them a super good letter.
There are many ways to be excellent¶
- Take more classes, get good grades
- Highly structured -- professor gives you problems.
Less-structured efforts where you pick the problem to work on:
- Go beyond the requirements in a course you really like.
- Help redesign a course.
- Conduct an independent study.
- Do research.
- Create a startup / non-profit.
Higher risk of failure, but you'll learn a lot more than if you just take classes!
An Email¶
Hey Sam, I'm glad you're talking about this. When I look at resumes, I skip past the GPA and courses and look at their project-based courses and what they did in those courses. Bonus points if they have independent projects, and if there's a link to their project on their resume, I will always click on it.
My Advice for People Starting Out¶
- Next quarter, take fewer courses, and be okay with getting Bs in some courses if it means you get to invest lots of time into the one course you really like.
- Really stand out in one course you like. Here are some examples:
- Going 2x above project requirements, then talk to your professor about your work.
- Ask lots of good questions after every lecture.
- Come to office hours regularly with questions about topics beyond the course.
- At the end of the quarter, ask your professor if you can help with their research group.
In Other Words...¶
There are "grades" you can get that are (much, much) higher than A's!
(From Dave Eckhardt)
Let's Be Pragmatic¶
- Yes, you need to work hard to develop strong technical skills!
- But the best jobs come through people that know you, not by submitting your resume into a pool of thousands of applicants.
- You should spend time thinking about how you can make use of being at UCSD to grow your network.
- PhD students and professors are a good place to start, but not the only way.
How to cold-email and actually get responses¶
- Pick a topic of interest.
- Do an independent project around that topic.
- Small and punchy is better than big! See some of Simon Willison's posts for examples.
- Put your project online as a publicly viewable webpage (like what you're doing for your final project!).
- Write an email that looks like this:
Hi {name},
My name is {name} and I'm an undergrad at UCSD studying {major}.
I'm really interested in learning more about your research in {topic} since I'm also working on projects in that area. For example, I wrote about my latest work on {my project} here: {url}.
Would you have some time in the next few weeks for a 30-min chat?
Key points:
- Email PhD students, not professors!
- Most professors I know get lots of low-quality cold emails every day, so many of them just ignore. But PhD students are less famous so they'll be happy that someone is interested in their work.
- Keep it short. If they can't read it in 10 seconds, it's too long!
- Share your URL. That shows that you actually have interest in the topic.
- Ask for a 30-min chat (not 1 hour). Again, people are busy!
- If they agree to meet, don't ask to join their lab right away. Talk to them about their work, then at the end, say something like, "Thanks again for taking the time to meet with me! This work fits really well with my interests, so I was wondering whether there might be opportunities to work with you as part of your research group."
- If they say yes, great! If they say no, ask them for other people they know who might also be a good fit.
Shifting your mentality¶
- What you want to avoid:
- "Please give me a research position, I'll do anything."
- Even though this might be truly how you feel!
- What you want to think instead:
- I'm a rising professional with lots of highly valuable skills.
- I also have specific interests.
- I'm looking for the right position that aligns with my interests so that I can contribute in a productive and meaningful way.
Final Exam Review¶
Okay, enough with the armchair advice, let's review for the final!
Course Topics¶
- Working with pandas
- Exploring and cleaning data
- Hypothesis and permutation testing
- Missingness and imputation
- HTTP and HTML
- Regular expressions
- Text features
- Linear regression
- Feature engineering
- Generalization and cross-validation
- Decision trees
- Classifier evaluation and fairness
Imputation¶
2, 100, 100, 2, 50, NA, NA
Hypothesis Testing¶
H0: A and B have the same distribution H1: A and B are not the same distribution (two-sided) mean(A) > mean(B) (one-sided)
B: know the distribution? then use multinomial - you know your population has a 50/50 split between M/F - you know your population has distribution of ethnicities - you know your population has a uniform distribution don't know the distribution, but want to compare to A? then use permutation test. - you want to compare baby weights for smoking mothers and non-smoking mothers
difference in means abs difference in means TVD K-S ...
Numeric distribution (histogram):
- difference in means (one-sided)
- abs difference in means (two-sided)
- K-S (two-sided, the shape, not diff in means)
Categorical distribution (bar plot):
- TVD (two-sided)
- Special case: if two categories, then abs diff in means can also work
Evaluation / Fairness¶
Precision + recall, you should be able to compute
- given confusion matrix, is precision > recall?
Draw the precision-recall curve given:
predicted probabilities | observed labels |
---|---|
0.2 | 0 |
0.7 | 1 |
0.3 | 1 |
0.4 | 0 |
0, 1, 0, 0
precision = 1 recall = 1/2
0.35
Parting Thoughts¶
Course goals ✅¶
In this course, you...
- Practiced translating potentially vague questions into quantitative questions about measurable observations.
- Learned to reason about 'black-box' processes (e.g. complicated models).
- Understood computational and statistical implications of working with data.
- Learned to use real data tools (e.g. love the documentation!).
- Got a taste of the "life of a data scientist".
Course outcomes ✅¶
Now, you...
- Are prepared for internships and data science "take home" interviews!
- Are ready to create your own portfolio of personal projects.
- Have the background and maturity to succeed in the upper-division.
Topics covered ✅¶
We learnt a lot this quarter.
- Week 1: From BabyPandas to Pandas
- Week 2: DataFrames
- Week 3: Messy Data, Hypothesis Testing
- Week 4: Missing Values and Imputation
- Week 5: HTTP, Midterm Exam
- Week 6: Web Scraping, Regex
- Week 7: Text Features, Regression
- Week 8: Feature Engineering
- Week 9: Generalization, CV, Decision Trees
- Week 10: Random Forests, Classifier Evaluation
Thank you!¶
This course would not have been possible without our 11 tutors and 1 TA: Praveen Nair, Gabriel Cha, Aritra Das, Mizuho Fukuda, Jasmine Lo, Ylesia Wu, Guoxuan (Jason) Xu, Sunan Xu, Andrew Yang, Diego Zavalza, Luran (Lauren) Zhang, and Qirui (Sara) Zheng.
Don't be a stranger – our contact information is at dsc80.com/staff!
- This quarter's course website will remain online permanently at dsc-courses.github.io.
Apply to be a tutor in the future! Learn more here.
Good luck on the Final Exam, and enjoy your summer break! 🎉