Lecture 24 – Cross-Validation

DSC 80, Spring 2022



Train-test split

Avoiding overfitting

Train-test split 🚆

sklearn.model_selection.train_test_split implements a train-test split for us! 🙏🏼

If X is an array/DataFrame of features and y is an array/Series of responses,

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.

Let's perform a train/test split on our tips dataset.

Before proceeding, let's check the sizes of X_train and X_test.

Example prediction pipeline


  1. Fit a model on the training set.
  2. Evaluate the model on the test set.

Here, we'll use a stand-alone LinearRegression model without a Pipeline, but this process would work the same if we were using a Pipeline.

Let's check our model's performance on the training set first.

And the test set:

Since rmse_train and rmse_test are similar, it doesn't seem like our model is overfitting to the training data. If rmse_test was much larger than rmse_train, it would be evidence that our model is unable to generalize well.


Example: Polynomial regression

Recall, last class we looked at an example of polynomial regression.

When building these models:

Parameters vs. hyperparameters

Training error vs. test error

Training error vs. test error

First, we perform a train-test split.

Now, we'll implement the logic from the previous slide.

Let's look at both the training RMSEs and test RMSEs we computed.


Here, we'd choose a degree of 3, since that degree has the lowest test error.

Training error vs. test error

The pattern we saw in the previous example is true more generally.

We pick the hyperparameter(s) at the "valley" of test error.

Note that training error tends to underestimate test error, but it doesn't have to – i.e., it is possible for test error to be lower than training error (say, if the test set is "easier" to predict than the training set).

Conducting train-test splits

But wait...


A single validation set

  1. Split the data into three sets: training, validation, and test.
  1. For each hyperparameter choice, train the model only on the training set, and evaluate the model's performance on the validation set.
  1. Find the hyperparameter with the best validation performance.
  1. Retrain the final model on the training and validation sets, and report its performance on the test set.

Issue: This strategy is too dependent on the validation set, which may be small and/or not a representative sample of the data.

$k$-fold cross-validation

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the following example).

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation

First, shuffle the dataset randomly and split it into $k$ disjoint groups. Then:

Creating folds in sklearn

sklearn has a KFold class that splits data into training and validation folds.

Let's use a simple dataset for illustration.

Let's instantiate a KFold object with $k=3$.

Finally, let's use kfold to split data:

Note that each value in data is used for validation exactly once and for training exactly twice. Also note that because we set shuffle=True the groups are not simply [10, 20], [30, 40], and [50, 60].

"Manual" $k$-fold cross-validation in sklearn

Note that for each choice of degree (our hyperparameter), we have five RMSEs, one for each "fold" of the data.

We should choose the degree with the lowest average validation RMSE.

Note that if we only performed non-$k$-fold cross-validation, we might pick a different degree:

"Semi-automatic" $k$-fold cross validation in sklearn

The cross_val_score function in sklearn implements a few of the previous steps in one.

cross_val_score(estimator, data, target, cv)

Specifically, it takes in:

and performs $k$-fold cross-validation, returning the values of the scoring metric on each fold.

That was considerably easier! Next class, we'll look at how to streamline this procedure even more (no loop necessary).

Note: You may notice that the RMSEs in the above table, particularly in Folds 1 and 5, are much higher than they were in the manual method. Can you think of reasons why, and how we might fix this? (Hint: Go back to the "manual" method and switch shuffle to False. What do you notice?)

Summary: Generalization

  1. Split the data into two sets: training and test.

  2. Use only the training data when designing, training, and tuning the model.

    • Use cross-validation to choose hyperparameters and estimate the model's ability to generalize.
    • Do not ❌ look at the test data in this step!
  3. Commit to your final model and train it using the entire training set.

  4. Test the data using the test data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

  5. Finally, train on all available data and ship the model to production! 🛳

🚨 This is the process you should always use! 🚨

Discussion Question 🤔

Example: Decision trees 🌲

Decision trees can be used for both regression and classification. We will start by discussing their use in classification.

Example: Predicting diabetes

For illustration, we'll use 'Glucose' and 'BMI' to predict whether or not a patient has diabetes (the response variable is in the 'Outcome' column).

Building a decision tree

Let's build a decision tree and interpret the results. But first, a train-test split:

The relevant class is DecisionTreeClassifier, from sklearn.tree.

Note that we fit it the same way we fit earlier estimators.

_You may wonder what max_depth=2 does – more on this soon!_

Visualizing decision trees

Our fit decision tree is like a "flowchart", made up of a series of questions.

Class 0 (orange) is "no diabetes"; Class 1 (blue) is "diabetes".

Evaluating classifiers

The most common evaluation metric in classification is accuracy:

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$

The score method of a classifier computes accuracy by default.

Some questions...

The answers will come next class!

Summary, next time