Lecture 23 – Pipelines and Evaluation

DSC 80, Spring 2022

For fun: read this article from CBS 8 San Diego 💰🚒🚔.

Last year, the city paid \$52.93 million in overtime to the fire department. San Diego Police Department received the second highest out of any other department with \\$36.6 million.



Models in sklearn

Example: Predicting 'tip' from 'total_bill' and 'size'

First, we instantiate and fit. By calling fit, we are saying "minimize mean squared error and find $w^*$".

After fitting, the predict method is available. Note that the argument to predict can be any 2D array with two columns.

We can access the intercepts and slopes individually. This model is of the form

$$\text{predicted tip} = w_0^* + w_1^* \cdot \text{total bill} + w_2^* \cdot \text{table size}$$

so we should expect three parameters total.

If we want to compute the RMSE of our model, we need to find its predictions on every row in the training data (tips).

It turns out that fit LinearRegression objects also have a score method:

That doesn't look like the RMSE... what is it? 🤔

Aside: $R^2$

$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$

Calculating $R^2$

Recall, all_preds contains the predicted 'tip' for every data point in tips.

Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$

Method 2: $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$

Note: By correlation here, we are referring to $r$.

Method 3: lr.score

All three methods provide the same result!

LinearRegression summary

Property Example Description
Initialize model parameters lr = LinearRegression() Create (empty) linear regression model
Fit the model to the data lr.fit(data, responses) Determines regression coefficients
Use model for prediction lr.predict(newdata) Use regression line make predictions
Evaluate the model lr.score(data, responses) Calculate the $R^2$ of the LR model
Access model attributes lr.coef_ Access the regression coefficients

Note: Once fit, estimators like LinearRegression are just transformers (predict <-> transform).


So far, we've used transformers for feature engineering and models for prediction. We can combine these steps into a single Pipeline.

Pipelines in sklearn

Creating a Pipeline

Let's build a Pipeline that:

Now that pl is instantiated, we fit it the same way we would fit the individual steps.

Now, to make predictions using raw data, all we need to do is use pl.predict:

pl performs both feature transformation and prediction with just a single call to predict!

We can access individual "steps" of a Pipeline through the named_steps attribute:

More sophisticated Pipelines

Let's perform different transformations on the quantitative and categorical features of tips (so, we will not transform 'tip').

Now, let's create a Pipeline using preproc as a transformer, and fit it:

Prediction is as easy as calling predict:

pl also has a score method, the same way a fit LinearRegression instance does:

Recall, we can access the individual "steps" in pl using the named_steps attribute:

Note: ColumnTransformer has a remainder argument that you can use to specify what to do with columns that aren't being transfromed ('drop' or 'passthrough').

Model evaluation 🧪


Evaluating the quality of a model

Example: Overfitting and underfitting

Let's collect two samples $\{(x_i, y_i)\}$ from the same data generating process.

For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly cubic; that is, $y \approx x^3$ (remember, in reality, you won't get to see the DGP).

Let's fit three polynomial models on Sample 1:

The PolynomialFeatures transformer will be helpful here.

Below, we look at our three models' predictions on Sample 1 (which they were trained on).

The degree 15 polynomial has the lowest RMSE on Sample 1.

How do things look in Sample 2?

What if we fit a degree 1, degree 3, and degree 15 polynomial on Sample 2 as well?

Key idea: The degree 15 polynomial seems to vary more than the degree 3 and 1 polynomials do.

Bias and variance

The training data we have access to is a sample from the DGP. We are concerned with our model's performance across different datasets from the same DGP.

Suppose we fit a model $H$ (e.g. a degree 3 polynomial) on several different datasets from a DGP. There are three sources of error that arise:

Here, suppose

Avoiding overfitting

Train-test split 🚆

sklearn.model_selection.train_test_split implements a train-test split for us! 🙏🏼

If X is an array/DataFrame of features and y is an array/Series of responses,

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.

Let's perform a train/test split on our tips dataset.

Before proceeding, let's check the sizes of X_train and X_test.

Example prediction pipeline


Here, we'll use a stand-alone LinearRegression model without a Pipeline, but this process would work the same if we were using a Pipeline.

Let's check our model's performance on the training set first.

And the test set:

Since rmse_train and rmse_test are similar, it doesn't seem like our model is overfitting to the training data. If rmse_test was much larger than rmse_train, it would be evidence that our model is unable to generalize well.

Summary, next time