In [1]:
from dsc80_utils import *
Announcements 📣¶
- Lab 8 due tomorrow.
- Final Project Checkpoint 2 due Tuesday.
Agenda¶
- Grid search
- Random forests
- Modeling with text features
- Classifier evaluation
Generalization¶
Question 🤔 (Answer at dsc80.com/q)
Code: 10cv
- Suppose you have a training dataset with 1000 rows.
- You want to decide between 20 hyperparameters for a particular model.
- To do so, you perform 10-fold cross-validation.
- How many times is the first row in the training dataset (
X.iloc[0]
) used for training a model?
Summary: Generalization¶
- Split the data into two sets: training and test.
- Use only the training data when designing, training, and tuning the model.
- Use $k$-fold cross-validation to choose hyperparameters and estimate the model's ability to generalize.
- Do not ❌ look at the test data in this step!
- Commit to your final model and train it using the entire training set.
- Test the data using the test data. If the performance (e.g. RMSE) is not acceptable, return to step 2.
- Finally, train on all available data and ship the model to production! 🛳
🚨 This is the process you should always use! 🚨
Decision trees 🌲¶
Decision trees can be used for both regression and classification. We'll start by using them for classification.
Example: Should I get groceries?¶
Decision trees make classifications by answering a series of yes/no questions.
Should I go to Trader Joe's to buy groceries today?
Internal nodes of trees involve questions; leaf nodes make predictions $H(x)$.
Example: Predicting diabetes¶
In [2]:
diabetes = pd.read_csv(Path('data') / 'diabetes.csv')
display_df(diabetes, cols=9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.63 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.35 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.67 | 32 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.24 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.35 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.32 | 23 | 0 |
768 rows × 9 columns
In [3]:
# 0 means no diabetes, 1 means yes diabetes.
diabetes['Outcome'].value_counts()
Out[3]:
Outcome 0 500 1 268 Name: count, dtype: int64
'Glucose'
is measured in mg/dL (milligrams per deciliter).
'BMI'
is calculated as $\text{BMI} = \frac{\text{weight (kg)}}{\left[ \text{height (m)} \right]^2}$.
- Let's use
'Glucose'
and'BMI'
to predict whether or not a patient has diabetes ('Outcome'
).
Exploring the dataset¶
First, a train-test split:
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)
Class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".
In [5]:
fig = (
X_train.assign(Outcome=y_train.astype(str))
.plot(kind='scatter', x='Glucose', y='BMI', color='Outcome',
color_discrete_map={'0': 'orange', '1': 'blue'},
title='Relationship between Glucose, BMI, and Diabetes')
)
fig