Lecture 25 – Decision Trees, Grid Search, and Multicollinearity

DSC 80, Spring 2022

Announcements

Agenda

Recap: Generalization

  1. Split the data into two sets: training and test.

  2. Use only the training data when designing, training, and tuning the model.

    • Use cross-validation to choose hyperparameters and estimate the model's ability to generalize.
    • Do not ❌ look at the test data in this step!
  3. Commit to your final model and train it using the entire training set.

  4. Test the data using the test data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

  5. Finally, train on all available data and ship the model to production! 🛳

🚨 This is the process you should always use! 🚨

Discussion Question 🤔

We won't answer this question in class, but it's a good exam-prep question!

Example: Decision trees 🌲 and grid searching

Decision trees can be used for both regression and classification. We will start by discussing their use in classification.

Example: Predicting diabetes

For illustration, we'll use 'Glucose' and 'BMI' to predict whether or not a patient has diabetes (the response variable is in the 'Outcome' column).

Building a decision tree

Let's build a decision tree and interpret the results. But first, a train-test split:

The relevant class is DecisionTreeClassifier, from sklearn.tree.

Note that we fit it the same way we fit earlier estimators.

_You may wonder what max_depth=2 does – more on this soon!_

Visualizing decision trees

Our fit decision tree is like a "flowchart", made up of a series of questions.

Class 0 (orange) is "no diabetes"; Class 1 (blue) is "diabetes".

Evaluating classifiers

The most common evaluation metric in classification is accuracy:

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$

The score method of a classifier computes accuracy by default.

Some questions...

Training a decision tree

When we ask a question, we are effectively splitting a node into two children – the "yes" child and the "no" child.

Suppose the distribution within a node looks like this (colors represent classes):

🔵🔵🔵🔵🔵🔵🔵🔴🔴🔴

Question A splits the node like this:

Question B splits the node like this:

Which question is "better"?

Question B, because there is "less uncertainty" in the resulting nodes after splitting by Question B than there is after splitting by Question A. There are two common techniques for quantifying "uncertainty":

Not the focus of our course, but read more!

Tree depth

Decision trees are trained by recursively picking the best split until:

By default, there is no "maximum depth" for a decision tree. As such, without restriction, decision trees tend to be very deep.

A decision tree fit on our training data has a depth of around 20! (It is so deep that tree.plot_tree errors when trying to plot it.)

At first, this tree seems "better" than our tree of depth 2, since its training accuracy is much much higher:

But recall, we truly care about test set performance, and this decision tree has worse accuracy on the test set than our depth 2 tree.

Decision trees and overfitting

Since sklearn.tree's plot_tree can't visualize extremely large decision trees, let's create and visualize some smaller decision trees.

As tree depth increases, complexity increases.

Question: What is the right maximum depth to choose?

Hyperparameters for decision trees

GridSearchCV takes in:

and performs $k$-fold cross-validation to find the combination of hyperparameters with the best average validation performance.

The following dictionary contains the values we're considering for each hyperparameter. (We're using GridSearchCV with 3 hyperparameters, but we could it with even just a single hyperparameter.)

Note that there are 140 combinations of hyperparameters we need to try. We need to find the best combination of hyperparameters, not the best value for each hyperparameter individually.

GridSearchCV needs to be instantiated and fit.

After being fit, the best_params_ attribute provides us with the best combination of hyperparameters to use.

All of the intermediate results – validation accuracies for each fold, mean validation accuaries, etc. – are stored in the cv_results_ attribute:

Note that the above DataFrame tells us that 5 * 140 = 700 models were trained in total!

Question: How is the following line of code making predictions?

Which model's testing accuracy is shown below?

Key takeaways

Multicollinearity

Motivating example

Suppose we fit a simple linear regression model that uses height in inches to predict weight in pounds.

$$\text{predicted weight} = w_0 + w_1 \cdot \text{height (inches)}$$

$w_0^*$ and $w_1^*$ are shown below, along with the model's testing RMSE.

Now, suppose we fit another regression model, that uses height in inches AND height in centimeters to predict weight.

$$\text{predicted weight} = w_0 + w_1 \cdot \text{height (inches)} + w_2 \cdot \text{height (cm)}$$

What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's testing RMSE?

Observation: The intercept is the same as before (roughly -81.17), as is the testing RMSE. However, the coefficients on 'Height(Inches)' and 'Height(cm)' are massive in size!

What's going on?

Redundant features

Let's use simpler numbers for illustration. Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.

$$\text{predicted weight} = -80 + 3 \cdot \text{height (inches)}$$

In the second model, we have:

$$\begin{align*}\text{predicted weight} &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \text{height (cm)} \\ &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \big( 2.54^* \cdot \text{height (inches)} \big) \\ &= w_0^* + \left(w_1^* + 2.54 \cdot w_2^* \right) \cdot \text{height (inches)} \end{align*}$$

In the first model, we already found the "best" intercept and slope in a linear model that uses height in inches to predict weight ($-80$ and $3$).

So, as long as $w_1^* + 2.54 \cdot w_2^* = 3$ in the second model, the second model's training predictions will be the same as the first, and hence they will also minimize RMSE.

Infinitely many parameter choices

Issue: There are an infinite number of $w_1^*$ and $w_2^*$ that satisfy $w_1^* + 2.54 \cdot w_2^* = 3$!

$$\text{predicted weight} = -80 - 10 \cdot \text{height (inches)} + \frac{13}{2.54} \cdot \text{height (cm)}$$
$$\text{predicted weight} = -80 + 10 \cdot \text{height (inches)} - \frac{7}{2.54} \cdot \text{height (cm)}$$

Multicollinearity

One-hot encoding and multicollinearity

When we one-hot encode categorical features, we create several redundant columns.

Aside: You can use pd.get_dummies in EDA, but don't use it for modeling (instead, use OneHotEncoder, which works with Pipelines).

Remember that under the hood, LinearRegression() creates a design matrix that has a column of all ones (for the intercept term). Let's add that column above for demonstration.

Now, many of the above columns can be written as linear combinations of other columns!

Note that if we get rid of the four redundant columns above, the rank of our design matrix – that is, the number of linearly independent columns it has – does not change (and so the "predictive power" of our features don't change either).

However, without the redundant columns, there is only a single unique set of optimal parameters $w^*$, and the multicollinearity is no more.

Aside: Most one-hot encoding techniques (including OneHotEncoder) have an in-built drop argument, which allow you to specify that you'd like to drop one column per categorical feature.

The above array only has $(2-1) + (2-1) + (4-1) + (2-1) = 6$ columns, rather than $2 + 2 + 4 + 2 = 10$, since we dropped 1 per categorical column in tips_features.

Key takeaways

Summary, next time

Summary