In [1]:
from dsc80_utils import *
Announcements 📣¶
- Final Project Checkpoint 2 due tomorrow.
- Lab 9 (the last one!) due Wednesday.
Final Exam 📝¶
- Saturday, June 8, 8-11am. Location TBD.
- Will write the exam to take about 2 hours, so you'll have a lot of time to double check your work.
- Two 8.5"x11" cheat sheets allowed of your own creation (handwritten on tablet, then printed is okay.)
- Covers every lecture, lab, and project.
- Similar format to the midterm: mix of fill-in-the-blank, multiple choice, and free response.
- I use
pandas
fill-in-the-blank questions to test your ability to read and write code, not just write code from scratch, which is why they can feel tricker.
- I use
- Questions on final about pre-Midterm material will be marked as "M". Your Midterm grade will be the higher of your (z-score adjusted) grades on the Midterm and the questions marked as "M" on the final.
Agenda¶
- Grid search
- Random forests
- Modeling with text features
- Classifier evaluation
Decision Trees¶
Question 🤔 (Answer at q.dsc80.com)
(Fa23 Final 10.1)
Suppose we fit decision trees of varying depths to predict 'y' using 'x1' and 'x2'. The entire training set is shown in the table below.
What is:
- The entropy of a node containing all the training points.
- The lowest possible entropy of a node in a fitted tree with depth 1 (two leaf nodes).
- The lowest possible entropy of a node in a fitted tree with depth 2 (four leaf nodes).
Example: Predicting Diabetes¶
In [2]:
from sklearn.model_selection import train_test_split
diabetes = pd.read_csv(Path('data') / 'diabetes.csv')
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)
In [3]:
fig = (
X_train.assign(Outcome=y_train.astype(str))
.plot(kind='scatter', x='Glucose', y='BMI', color='Outcome',
color_discrete_map={'0': 'orange', '1': 'blue'},
title='Relationship between Glucose, BMI, and Diabetes')
)
fig