from dsc80_utils import *
import lec16_util as util
📣 Announcements 📣¶
- Project 4 due tomorrow!
- Lab 9 out, due Dec 4.
- Final Exam on Mon, Dec 11, 3-6pm in WLH 2005 (our usual lecture room).
- Two cheat sheets allowed (feel free to reuse your midterm sheet).
- More details to come.
📆 Agenda¶
Practice Exam Question 🤔¶
- Suppose you have a training dataset with 1000 rows.
- You want to decide between 20 hyperparameters for a particular model.
- To do so, you perform 10-fold cross-validation.
- How many times is the first row in the training dataset (
X.iloc[0]
) used for training a model?
Review: Bias and Variance¶
np.random.seed(23) # For reproducibility.
def sample_dgp(n=100):
x = np.linspace(-2, 3, n)
y = x ** 3 + (np.random.normal(0, 3, size=n))
return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_dgp()
sample_2 = sample_dgp()
# Look at the definition of train_and_plot in lec15_util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25])
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')
Bias and variance¶
The training data we have access to is a sample from the DGP. We are concerned with our model's ability to generalize and work well on different datasets drawn from the same DGP.
Suppose we fit a model $H$ (e.g. a degree 3 polynomial) on several different datasets from a DGP. There are three sources of error that arise:
- ⭐️ Model Bias: The expected deviation between a predicted value and an actual value.
- In other words, for a given $x_i$, how far is $H(x_i)$ from the true $y_i$, on average?
- Low bias is good! ✅
- High bias is a sign of underfitting, i.e. that our model is too basic to capture the relationship between our features and response.
⭐️ Model variance ("variance"): The variance of a model's predictions.
- In other words, for a given $x_i$, what is the variance of $H(x_i)$ across all datasets?
- Low model variance is good! ✅
- High model variance is a sign of overfitting, i.e. that our model is too complicated and is prone to fitting to the noise in our training data.
Observation variance: The variance due to the random noise in the process we are trying to model (e.g. measurement error). We can't control this, without collecting more data!
(See hand-written notes from lecture for more detail.)
Implications of Bias and Variance¶
- Risk: $ R(H) = \text{bias}^2 + \text{variance} + \text{irreducible error} $
Model Fit:
- Underfitting = too much bias
- Most overfitting = too much variance
- Training error reflects bias but not variance.
- Test error reflects both bias and variance.
As $n$ increases:
- Generally, $ n\uparrow $ means variance $ \downarrow $.
- If $ H(x) $ can fit the true DGP exactly, then $ n\uparrow $ means bias $ \downarrow $.
- For certain loss functions (e.g. MSE), bias will be 0 if $ H(x) $ can fit the true DGP exactly.
- If $ H(x) $ cannot fit the true DGP well, then bias large for most points.
As we add more features:
- Adding a useful feature reduces bias.
- Adding a useless feature doesn't change bias.
- Adding feature generally increases variance, even if it's useless.
In real life:
- Don't usually know the true DGP, so we can't put actual numbers to bias-variance decomposition.
- Train-test split so we can estimate $ R(h) $ using the test set.
Example: Linear Regression¶
- If actual DGP is a linear model:
- Bias = 0.
- Variance $ \propto \frac{d}{n} $, where $ d $ is the dimension (number of features) per sample point.
- $ n \uparrow $ = variance $ \downarrow $
- $ d \uparrow $ = variance $ \uparrow $
Summary: Generalization¶
Split the data into two sets: training and test.
Use only the training data when designing, training, and tuning the model.
- Use $k$-fold cross-validation to choose hyperparameters and estimate the model's ability to generalize.
- Do not ❌ look at the test data in this step!
Commit to your final model and train it using the entire training set.
Test the data using the test data. If the performance (e.g. RMSE) is not acceptable, return to step 2.
Finally, train on all available data and ship the model to production! 🛳
🚨 This is the process you should always use! 🚨
Decision trees 🌲¶
Although decision trees can be used for both regression and classification, we'll be using them for classification.
Example: Should I get groceries?¶
- Internal nodes of tree check feature values.
- Leaf nodes of tree specify class $H(x)$.
Example: Predicting diabetes¶
diabetes = pd.read_csv('data/diabetes.csv')
display_df(diabetes, cols=9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.63 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.35 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.67 | 32 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.24 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.35 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.32 | 23 | 0 |
768 rows × 9 columns
# 0 means no diabetes, 1 means yes diabetes.
diabetes['Outcome'].value_counts()
0 500 1 268 Name: Outcome, dtype: int64
'Glucose'
is measured in mg/dL (milligrams per deciliter).'BMI'
is calculated as $\text{BMI} = \frac{\text{weight (kg)}}{\left[ \text{height (m)} \right]^2}$.Let's use
'Glucose'
and'BMI'
to predict whether or not a patient has diabetes ('Outcome'
).
Exploring the dataset¶
First, a train-test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)
Class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".
fig = (
X_train.assign(Outcome=y_train.astype(str))
.plot(kind='scatter', x='Glucose', y='BMI', color='Outcome',
color_discrete_map={'0': 'orange', '1': 'blue'},
title='Relationship between Glucose, BMI, and Diabetes')
)
fig
Building a decision tree¶
Let's build a decision tree and interpret the results.
The relevant class is DecisionTreeClassifier
, from sklearn.tree
.
from sklearn.tree import DecisionTreeClassifier
Note that we fit
it the same way we fit
earlier estimators.
You may wonder what max_depth
and criterion
do – more on this soon!
dt = DecisionTreeClassifier(max_depth=2, criterion='entropy')
dt.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=2)
Visualizing decision trees¶
Our fit decision tree is like a "flowchart", made up of a series of questions.
As before, orange is "no diabetes" and blue is "diabetes".
from sklearn.tree import plot_tree
plt.figure(figsize=(15, 5))
plot_tree(dt, feature_names=X_train.columns, class_names=['no db', 'yes db'],
filled=True, fontsize=15, impurity=False);
- To classify a new data point, we start at the top and answer the first question (i.e. "Glucose <= 129.5").
- If the answer is "Yes", we move to the left branch, otherwise we move to the right branch.
- We repeat this process until we end up at a leaf node, at which point we predict the most common class in that node.
- Note that each node has a
value
attribute, which describes the number of training individuals of each class that fell in that node.
- Note that each node has a
# Note that the left node at depth 2 has a `value` of [304, 78].
y_train[X_train.query('Glucose <= 129.5').index].value_counts()
0 304 1 78 Name: Outcome, dtype: int64
Evaluating classifiers¶
The most common evaluation metric in classification is accuracy:
$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$(dt.predict(X_train) == y_train).mean()
0.765625
The score
method of a classifier computes accuracy by default (just like the score
method of a regressor computes $R^2$ by default). We want our classifiers to have high accuracy.
# Training accuracy – same number as above
dt.score(X_train, y_train)
0.765625
# Testing accuracy
dt.score(X_test, y_test)
0.7760416666666666
About decision trees¶
- Can work with categorical data too without using one-hot encoding.
- Interpretable predictions.
- Decision boundary can be arbitrarily complicated.
- Works with multi-class classification (e.g. more than 2 possible outcomes).
How do we train?¶
Pseudocode:
def make_tree(X, y):
if all points in y have the same label C:
return Leaf(C)
f = best splitting feature # e.g. Glucose or BMI
v = best splitting value # e.g. 129.5
X_left, y_left = X, y where (X[f] < v)
X_right, y_right = X, y where (X[f] >= v)
left = make_tree(X_left, y_left)
right = make_tree(X_right, y_right)
return Node(f, v, left, right)
make_tree(X_train, y_train)
How do we decide on the best split?¶
- Choose a loss function $ L(X, y) $.
- Try all splits, then pick the one that minimizes $ L(X_{\text{left}}, y_{\text{left}}) + L(X_{\text{right}}, y_{\text{right}}) $.
- What's a good $ L(X, y) $?
Intuition: Suppose the distribution within a node looks like this (colors represent classes):
Split A:
- "Yes": 🟠🟠🟠🔵🔵🔵
- "No": 🟠🟠🟠🔵🔵🔵🔵
Split B:
- "Yes": 🔵🔵🔵🔵🔵🔵
- "No": 🔵🟠🟠🟠🟠🟠🟠
Which split is "better"?
Split B, because there is "less uncertainty" in the resulting nodes in split B than there is in split A.
One (bad) idea:¶
- Label a node with the majority class $ C $.
- $ L(X, y) $ = number of points where $ y \neq C $.
Why is this bad? Suppose we have:
Split A:
- "Yes": 🟠🟠🟠🟠🟠🟠🔵
- "No": 🟠🟠🟠🟠🟠🟠🔵🔵🔵🔵🔵
Split B:
- "Yes": 🟠🟠🟠🟠🟠🟠🔵🔵🔵
- "No": 🟠🟠🟠🟠🟠🟠🔵🔵🔵
We prefer Split A, but $ L(X, y) = 6 $ for both.
A better idea: entropy¶
- For each label $C$ within a node, define $p_C$ as the proportion of points with the label.
- The surprise of drawing a point from the node at random and having it be class $C$ is:
- And the entropy of a node is the average surprise over all classes:
- The entropy of 🟠🟠🟠🟠🟠🟠🟠🟠 is $ -1 \log_2(1) = 0 $.
- The entropy of 🟠🟠🟠🟠🔵🔵🔵🔵 is $ -0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 1 $.
- The entropy of 🟠🔵🟢🟡🟣 = $ - \log_2 \frac{1}{5} = \log_2(5) $
- In general, if there are $n$ points all with different labels, entropy = $ \log_2(n) $
Entropy Example¶
Suppose we have:
Split A:
- "Yes": 🟠🟠🟠🟠🟠🟠🔵
- "No": 🟠🟠🟠🟠🟠🟠🔵🔵🔵🔵🔵
Split B:
- "Yes": 🟠🟠🟠🟠🟠🟠🔵🔵🔵
- "No": 🟠🟠🟠🟠🟠🟠🔵🔵🔵
def entropy(labels):
props = pd.Series(list(labels)).value_counts() / len(labels)
return -sum(props * np.log2(props))
split_a = entropy("🟠🟠🟠🟠🟠🟠🔵") + entropy("🟠🟠🟠🟠🟠🟠🔵🔵🔵🔵🔵")
split_b = entropy("🟠🟠🟠🟠🟠🟠🔵🔵🔵") + entropy("🟠🟠🟠🟠🟠🟠🔵🔵🔵")
split_a, split_b
(1.5857029900592838, 1.8365916681089791)
Split A has lower entropy, so we'll pick it.
Runtime (optional)¶
Predict a point: traverse tree until leaf.
- Runtime is $ O(\text{tree depth}) $.
- If all features are binary (two categories), then tree depth ≤ $d$ (number of features).
- Usually ≤ $O(\log n)$ , but not always.
Training:
- For binary features, need to try $ O(d) $ splits at each node
- For numeric features, there's a way to check all splits in $ O(n') $ time, where $ n' $ is the number of points in the node. Since there can be $ d $ numeric features, overall runtime is $ O(n'd) $ for each node.
- Each point is used in $ O(\text{depth}) $ nodes, so overall runtime to fit is $ O(nd \cdot \text{depth}) $.
- Since depth is often logarithmic, runtime is pretty fast!
Tree depth¶
Decision trees are trained by recursively picking the best split until:
- all "leaf nodes" only contain training examples from a single class (good), or
- it is impossible to split leaf nodes any further (not good).
By default, there is no "maximum depth" for a decision tree. As such, without restriction, decision trees tend to be very deep.
dt_no_max = DecisionTreeClassifier()
dt_no_max.fit(X_train, y_train)
DecisionTreeClassifier()
A decision tree fit on our training data has a depth of around 20! (It is so deep that tree.plot_tree
errors when trying to plot it.)
dt_no_max.tree_.max_depth
22
At first, this tree seems "better" than our tree of depth 2, since its training accuracy is much much higher:
dt_no_max.score(X_train, y_train)
0.9913194444444444
# Depth 2 tree.
dt.score(X_train, y_train)
0.765625
But recall, we truly care about test set performance, and this decision tree has worse accuracy on the test set than our depth 2 tree.
dt_no_max.score(X_test, y_test)
0.71875
# Depth 2 tree.
dt.score(X_test, y_test)
0.7760416666666666
Decision trees and overfitting¶
Decision trees have a tendency to overfit. Why is that?
Unlike linear classification techniques (like logistic regression or SVMs), decision trees are non-linear.
- They are also "non-parametric" – there are no $w^*$s to learn.
While being trained, decision trees ask enough questions to effectively memorize the correct response values in the training set. However, the relationships they learn are often overfit to the noise in the training set, and don't generalize well.
fig
A decision tree whose depth is not restricted will achieve 100% accuracy on any training set, as long as there are no "overlapping values" in the training set.
- Two values overlap when they have the same features $x$ but different response values $y$ (e.g. if two patients have the same glucose levels and BMI, but one has diabetes and one doesn't).
One solution: Make the decision tree "less complex" by limiting the maximum depth.
Since sklearn.tree
's plot_tree
can't visualize extremely large decision trees, let's create and visualize some smaller decision trees.
trees = {}
for d in [2, 4, 8]:
trees[d] = DecisionTreeClassifier(max_depth=d, random_state=1)
trees[d].fit(X_train, y_train)
plt.figure(figsize=(15, 5), dpi=100)
plot_tree(trees[d], feature_names=X_train.columns, class_names=['no db', 'yes db'],
filled=True, rounded=True, impurity=False)
plt.show()
As tree depth increases, complexity increases, and our trees are more prone to overfitting.
Question: What is the "right" maximum depth to choose?
Hyperparameters for decision trees¶
max_depth
is a hyperparameter forDecisionTreeClassifier
.There are many more hyperparameters we can tweak; look at the documentation for examples.
min_samples_split
: The minimum number of samples required to split an internal node.criterion
: The function to measure the quality of a split ('gini'
or'entropy'
).
To ensure that our model generalizes well to unseen data, we need an efficient technique for trying different combinations of hyperparameters!
Thinking about bias and variance¶
- Bigger
max_depth
= less bias, more variance. - Bigger
min_samples_split
= more bias, less variance. (Why?)
Grid search¶
Grid search¶
GridSearchCV
takes in:
- an un-
fit
instance of an estimator, and - a dictionary of hyperparameter values to try,
and performs $k$-fold cross-validation to find the combination of hyperparameters with the best average validation performance.
from sklearn.model_selection import GridSearchCV
The following dictionary contains the values we're considering for each hyperparameter. (We're using GridSearchCV
with 3 hyperparameters, but we could use it with even just a single hyperparameter.)
hyperparameters = {
'max_depth': [2, 3, 4, 5, 7, 10, 13, 15, 18, None],
'min_samples_split': [2, 5, 10, 20, 50, 100, 200],
'criterion': ['gini', 'entropy']
}
Note that there are 140 combinations of hyperparameters we need to try. We need to find the best combination of hyperparameters, not the best value for each hyperparameter individually.
len(hyperparameters['max_depth']) * \
len(hyperparameters['min_samples_split']) * \
len(hyperparameters['criterion'])
140
GridSearchCV
needs to be instantiated and fit
.
searcher = GridSearchCV(DecisionTreeClassifier(), hyperparameters, cv=5)
%%time
searcher.fit(X_train, y_train)
CPU times: user 1.05 s, sys: 1.68 ms, total: 1.05 s Wall time: 1.06 s
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4, 5, 7, 10, 13, 15, 18, None], 'min_samples_split': [2, 5, 10, 20, 50, 100, 200]})
After being fit
, the best_params_
attribute provides us with the best combination of hyperparameters to use.
searcher.best_params_
{'criterion': 'gini', 'max_depth': 4, 'min_samples_split': 50}
All of the intermediate results – validation accuracies for each fold, mean validation accuaries, etc. – are stored in the cv_results_
attribute:
searcher.cv_results_['mean_test_score'] # Array of length 140.
array([0.73, 0.73, 0.73, ..., 0.75, 0.74, 0.72])
# Rows correspond to folds, columns correspond to hyperparameter combinations.
pd.DataFrame(np.vstack([searcher.cv_results_[f'split{i}_test_score'] for i in range(5)]))
0 | 1 | 2 | 3 | ... | 136 | 137 | 138 | 139 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.71 | 0.71 | 0.71 | 0.71 | ... | 0.70 | 0.68 | 0.71 | 0.73 |
1 | 0.77 | 0.77 | 0.77 | 0.77 | ... | 0.82 | 0.83 | 0.77 | 0.76 |
2 | 0.74 | 0.74 | 0.74 | 0.74 | ... | 0.68 | 0.72 | 0.74 | 0.73 |
3 | 0.70 | 0.70 | 0.70 | 0.70 | ... | 0.77 | 0.79 | 0.76 | 0.70 |
4 | 0.72 | 0.72 | 0.72 | 0.72 | ... | 0.70 | 0.71 | 0.72 | 0.70 |
5 rows × 140 columns
Note that the above DataFrame tells us that 5 * 140 = 700 models were trained in total!
Now that we've found the best combination of hyperparameters, we should fit a decision tree instance using those hyperparameters on our entire training set.
searcher.best_params_
{'criterion': 'gini', 'max_depth': 4, 'min_samples_split': 50}
final_tree = DecisionTreeClassifier(**searcher.best_params_)
final_tree
DecisionTreeClassifier(max_depth=4, min_samples_split=50)
final_tree.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=4, min_samples_split=50)
# Training accuracy.
final_tree.score(X_train, y_train)
0.7881944444444444
# Testing accuracy.
final_tree.score(X_test, y_test)
0.765625
Remember, searcher
itself is a model object (we had to fit
it). After performing $k$-fold cross-validation, behind the scenes, searcher
is trained on the entire training set using the optimal combination of hyperparameters.
In other words, searcher
makes the same predictions that final_tree
does!
searcher.score(X_train, y_train)
0.7881944444444444
searcher.score(X_test, y_test)
0.765625
Choosing possible hyperparameter values¶
A full grid search can take a long time.
- In our previous example, we tried 140 combinations of hyperparameters.
- Since we performed 5-fold cross-validation, we trained 700 decision trees under the hood.
Question: How do we pick the possible hyperparameter values to try?
Answer: Trial and error.
- If the "best" choice of a hyperparameter was at an extreme, try increasing the range.
- For instance, if you try
max_depth
s from 32 to 128, and 32 was the best, try includingmax_depths
under 32.
Key takeaways¶
- Decision trees are trained by finding the best questions to ask using the features in the training data. A good question is one that isolates classes as much as possible.
- Decision trees have a tendency to overfit to training data. One way to mitigate this is by restricting the maximum depth of the tree.
- To efficiently find hyperparameters through cross-validation, use
GridSearchCV
.- Specify which values to try for each hyperparameter, and
GridSearchCV
will try all unique combinations of hyperparameters and return the combination with the best average validation performance. GridSearchCV
is not the only solution – seeRandomizedSearchCV
if you're curious.
- Specify which values to try for each hyperparameter, and