Lecture 24 – Decision Trees, Grid Search, Multicollinearity

DSC 80, Winter 2023

Announcements

Agenda

Cross-validation

Recap

$k$-fold cross-validation

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation (or simply "cross-validation") is the technique we will use for finding hyperparameters.

$k$-fold cross-validation

First, shuffle the dataset randomly and split it into $k$ disjoint groups. Then:

As a reminder, here's what "sample 1" looks like.

$k$-fold cross-validation in sklearn

Soon, we'll look at how to implement this procedure without needing to for-loop over values of d.

$k$-fold cross-validation in sklearn

Note that for each choice of degree (our hyperparameter), we have five RMSEs, one for each "fold" of the data. This means that in total, 125 models were trained/fit to data!

We should choose the degree with the lowest average validation RMSE.

Note that if we didn't perform $k$-fold cross-validation, but instead just used a single validation set, we may have ended up with a different result:

Note: You may notice that the RMSEs in Folds 1 and 5 are significantly higher than in other folds. Can you think of reasons why, and how we might fix this?

Another example: Tips

We can also use $k$-fold cross-validation to determine which subset of features to use in a linear model that predicts tips (though, as you'll see, the code is not pretty).

As we should always do, we'll perform a train-test split on tips and will only use the training data for cross-validation.

Even though the third model has the lowest average validation RMSE, its average validation RMSE is very close to that of the other, simpler models, and as a result we'd likely use the simplest model in practice.

Summary: Generalization

  1. Split the data into two sets: training and test.

  2. Use only the training data when designing, training, and tuning the model.

    • Use $k$-fold cross-validation to choose hyperparameters and estimate the model's ability to generalize.
    • Do not ❌ look at the test data in this step!
  3. Commit to your final model and train it using the entire training set.

  4. Test the data using the test data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

  5. Finally, train on all available data and ship the model to production! 🛳

🚨 This is the process you should always use! 🚨

Discussion Question 🤔

Example: Decision trees 🌲

Decision trees can be used for both regression and classification. We will start by discussing their use in classification.

Example: Predicting diabetes

Exploring the dataset

Class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".

Building a decision tree

Let's build a decision tree and interpret the results. But first, a train-test split:

The relevant class is DecisionTreeClassifier, from sklearn.tree.

Note that we fit it the same way we fit earlier estimators.

_You may wonder what max_depth=2 does – more on this soon!_

Visualizing decision trees

Our fit decision tree is like a "flowchart", made up of a series of questions.

As before, orange is "no diabetes" and blue is "diabetes".

Evaluating classifiers

The most common evaluation metric in classification is accuracy:

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$

The score method of a classifier computes accuracy by default (just like the score method of a regressor computes $R^2$ by default). We want our classifiers to have high accuracy.

Some questions...

Training a decision tree

When we ask a question, we are effectively splitting a node into two children – the "yes" child and the "no" child.

Suppose the distribution within a node looks like this (colors represent classes):

🟠🟠🟠🔵🔵🔵🔵🔵🔵🔵

Question A splits the node like this:

Question B splits the node like this:

Which question is "better"?

Question B, because there is "less uncertainty" in the resulting nodes after splitting by Question B than there is after splitting by Question A. There are two common techniques for quantifying "uncertainty":

Not the focus of our course, but read more!

Tree depth

Decision trees are trained by recursively picking the best split until:

By default, there is no "maximum depth" for a decision tree. As such, without restriction, decision trees tend to be very deep.

A decision tree fit on our training data has a depth of around 20! (It is so deep that tree.plot_tree errors when trying to plot it.)

At first, this tree seems "better" than our tree of depth 2, since its training accuracy is much much higher:

But recall, we truly care about test set performance, and this decision tree has worse accuracy on the test set than our depth 2 tree.

Decision trees and overfitting

Since sklearn.tree's plot_tree can't visualize extremely large decision trees, let's create and visualize some smaller decision trees.

As tree depth increases, complexity increases, and our trees are more prone to overfitting.

Question: What is the "right" maximum depth to choose?

Hyperparameters for decision trees

GridSearchCV takes in:

and performs $k$-fold cross-validation to find the combination of hyperparameters with the best average validation performance.

The following dictionary contains the values we're considering for each hyperparameter. (We're using GridSearchCV with 3 hyperparameters, but we could use it with even just a single hyperparameter.)

Note that there are 140 combinations of hyperparameters we need to try. We need to find the best combination of hyperparameters, not the best value for each hyperparameter individually.

GridSearchCV needs to be instantiated and fit.

After being fit, the best_params_ attribute provides us with the best combination of hyperparameters to use.

All of the intermediate results – validation accuracies for each fold, mean validation accuaries, etc. – are stored in the cv_results_ attribute:

Note that the above DataFrame tells us that 5 * 140 = 700 models were trained in total!

Now that we've found the best combination of hyperparameters, we should fit a decision tree instance using those hyperparameters on our entire training set.

Remember, searcher itself is a model object (we had to fit it). After performing $k$-fold cross-validation, behind the scenes, searcher is trained on the entire training set using the optimal combination of hyperparameters.

In other words, searcher makes the same predictions that final_tree does!

Choosing possible hyperparameter values

Key takeaways

Multicollinearity

Heights and weights

We have a dataset containing the weights and heights of 25,0000 18 year olds, taken from here.

Motivating example

Suppose we fit a simple linear regression model that uses height in inches to predict weight in pounds.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)}$$

$w_0^*$ and $w_1^*$ are shown below, along with the model's testing RMSE.

Now, suppose we fit another regression model, that uses height in inches AND height in centimeters to predict weight.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)} + w_2 \cdot \text{height (cm)}$$

What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's testing RMSE?

Observation: The intercept is the same as before (roughly -81.17), as is the testing RMSE. However, the coefficients on 'Height (Inches)' and 'Height (cm)' are massive in size!

What's going on?

Redundant features

Let's use simpler numbers for illustration. Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.

$$\text{predicted weight (pounds)} = -80 + 3 \cdot \text{height (inches)}$$

In the second model, we have:

$$\begin{align*}\text{predicted weight (pounds)} &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \text{height (cm)} \\ &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \big( 2.54^* \cdot \text{height (inches)} \big) \\ &= w_0^* + \left(w_1^* + 2.54 \cdot w_2^* \right) \cdot \text{height (inches)} \end{align*}$$

In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.

So, as long as $w_1^* + 2.54 \cdot w_2^* = 3$ in the second model, the second model's training predictions will be the same as the first, and hence they will also minimize RMSE.

Infinitely many parameter choices

Issue: There are an infinite number of $w_1^*$ and $w_2^*$ that satisfy $w_1^* + 2.54 \cdot w_2^* = 3$!

$$\text{predicted weight} = -80 - 10 \cdot \text{height (inches)} + \frac{13}{2.54} \cdot \text{height (cm)}$$
$$\text{predicted weight} = -80 + 10 \cdot \text{height (inches)} - \frac{7}{2.54} \cdot \text{height (cm)}$$

Multicollinearity

Key takeaways

Summary, next time

Summary

See the individual sections for more specific "key takeaways".

Next time