Lecture 25 – Decision Trees, Grid Search, and Multicollinearity

DSC 80, Spring 2022

Announcements

Agenda

Recap: Generalization

  1. Split the data into two sets: training and test.

  2. Use only the training data when designing, training, and tuning the model.

    • Use cross-validation to choose hyperparameters and estimate the model's ability to generalize.
    • Do not ❌ look at the test data in this step!
  3. Commit to your final model and train it using the entire training set.

  4. Test the data using the test data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

  5. Finally, train on all available data and ship the model to production! 🛳

🚨 This is the process you should always use! 🚨

Discussion Question 🤔

We won't answer this question in class, but it's a good exam-prep question!

Example: Decision trees 🌲 and grid searching

Decision trees can be used for both regression and classification. We will start by discussing their use in classification.

Example: Predicting diabetes

For illustration, we'll use 'Glucose' and 'BMI' to predict whether or not a patient has diabetes (the response variable is in the 'Outcome' column).

Building a decision tree

Let's build a decision tree and interpret the results. But first, a train-test split:

The relevant class is DecisionTreeClassifier, from sklearn.tree.

Note that we fit it the same way we fit earlier estimators.

_You may wonder what max_depth=2 does – more on this soon!_

Visualizing decision trees

Our fit decision tree is like a "flowchart", made up of a series of questions.

Class 0 (orange) is "no diabetes"; Class 1 (blue) is "diabetes".

Evaluating classifiers

The most common evaluation metric in classification is accuracy:

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$

The score method of a classifier computes accuracy by default.

Some questions...

Training a decision tree

When we ask a question, we are effectively splitting a node into two children – the "yes" child and the "no" child.

Suppose the distribution within a node looks like this (colors represent classes):

🔵🔵🔵🔵🔵🔵🔵🔴🔴🔴

Question A splits the node like this:

Question B splits the node like this:

Which question is "better"?

Question B, because there is "less uncertainty" in the resulting nodes after splitting by Question B than there is after splitting by Question A. There are two common techniques for quantifying "uncertainty":

Not the focus of our course, but read more!

Tree depth

Decision trees are trained by recursively picking the best split until:

By default, there is no "maximum depth" for a decision tree. As such, without restriction, decision trees tend to be very deep.

A decision tree fit on our training data has a depth of around 20! (It is so deep that tree.plot_tree errors when trying to plot it.)

At first, this tree seems "better" than our tree of depth 2, since its training accuracy is much much higher:

But recall, we truly care about test set performance, and this decision tree has worse accuracy on the test set than our depth 2 tree.

Decision trees and overfitting

Since sklearn.tree's plot_tree can't visualize extremely large decision trees, let's create and visualize some smaller decision trees.