In [1]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'

from sklearn.linear_model import LinearRegression

import util

Lecture 22 – Pipelines, Generalization¶

DSC 80, Spring 2023¶

Agenda¶

  • The modeling process.
  • Transformers in sklearn.
  • Pipelines.
  • Generalization.

The modeling process¶

The modeling process¶

  1. Create (engineer) features to best reflect the "meaning" behind data.
  1. Choose a model that is appropriate to capture the relationships between features ($X$) and the target/response ($y$).
  1. Select a loss function and fit the model (i.e., determine $w^*$).
  1. Evaluate the model (e.g. using RMSE or $R^2$).

We can perform all of the above directly in sklearn!

preprocessing and linear_models¶

For the feature engineering step of the modeling pipeline, we will use sklearn's preprocessing module.

For the model creation step of the modeling pipeline, we will use sklearn's linear_model module, as we've already seen. linear_model.LinearRegression is an example of an estimator class.

Transformers in sklearn¶

Transformer classes¶

  • Transformers take in "raw" data and output "processed" data. They are used for creating features.
  • The input to a transformer should be a multi-dimensional numpy array.
    • Inputs can be DataFrames, but sklearn only looks at the values (i.e. it calls to_numpy() on input DataFrames).
  • The output of a transformer is a numpy array (never a DataFrame or Series).
  • Transformers, like most relevant features of sklearn, are classes, not functions, meaning you need to instantiate them and call their methods.

Case study: Restaurant tips 🧑‍🍳¶

We'll continue working with our trusty tips dataset.

In [2]:
tips = px.data.tips()
tips.head()
Out[2]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Example transformer: Binarizer¶

The Binarizer transformer allows us to map a quantitative sequence to a sequence of 1s and 0s, depending on whether values are above or below a threshold.

Property Example Description
Initialize with parameters binar = Binarizer(thresh) set x=1 if x > thresh, else 0
Transform data in a dataset feat = binar.transform(data) Binarize all columns in data

First, we need to import the relevant class from sklearn.preprocessing. (Tip: import just the relevant classes you need from sklearn.)

In [3]:
from sklearn.preprocessing import Binarizer

Let's try binarizing 'total_bill'. We'll say a "large" bill is one that is strictly greater than $20.

In [4]:
tips['total_bill'].head()
Out[4]:
0    16.99
1    10.34
2    21.01
3    23.68
4    24.59
Name: total_bill, dtype: float64

First, we initialize a Binarizer object with the threshold we want.

In [5]:
bi = Binarizer(threshold=20)

Then, we call bi's transform method and pass it the data we'd like to transform. Note that its input and output are both 2D.

In [6]:
transformed_bills = bi.transform(tips[['total_bill']]) # Must give transform a 2D array/DataFrame.
transformed_bills[:5]
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/base.py:434: UserWarning: X has feature names, but Binarizer was fitted without feature names
  warnings.warn(
Out[6]:
array([[0.],
       [0.],
       [1.],
       [1.],
       [1.]])

Example transformer: StdScaler¶

  • StdScaler standardizes data using the mean and standard deviation of the data.
$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$
  • Unlike Binarizer, StdScaler requires some knowledge (mean and SD) of the dataset before transforming.
  • As such, we need to fit an StdScaler transformer before we can use the transform method.
  • Typical usage: fit transformer on a sample, use that fit transformer to transform future data.

Example transformer: StdScaler¶

It only makes sense to standardize the already-quantitative features of tips, so let's select just those.

In [7]:
tips_quant = tips[['total_bill', 'size']]
tips_quant.head()
Out[7]:
total_bill size
0 16.99 2
1 10.34 3
2 21.01 3
3 23.68 2
4 24.59 4

Let's initialize a StandardScaler object.

In [8]:
from sklearn.preprocessing import StandardScaler
In [9]:
stdscaler = StandardScaler()

Note that the following does not work! The error message is very helpful.

In [10]:
stdscaler.transform(tips_quant)
---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 stdscaler.transform(tips_quant)

File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:970, in StandardScaler.transform(self, X, copy)
    955 def transform(self, X, copy=None):
    956     """Perform standardization by centering and scaling.
    957 
    958     Parameters
   (...)
    968         Transformed array.
    969     """
--> 970     check_is_fitted(self)
    972     copy = copy if copy is not None else self.copy
    973     X = self._validate_data(
    974         X,
    975         reset=False,
   (...)
    980         force_all_finite="allow-nan",
    981     )

File ~/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/utils/validation.py:1208, in check_is_fitted(estimator, attributes, msg, all_or_any)
   1203     fitted = [
   1204         v for v in vars(estimator) if v.endswith("_") and not v.startswith("__")
   1205     ]
   1207 if not fitted:
-> 1208     raise NotFittedError(msg % {"name": type(estimator).__name__})

NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Instead, we need to first call the fit method on stdscaler.

In [11]:
# This is like saying "determine the mean and SD of each column in tips_quant".
stdscaler.fit(tips_quant)
Out[11]:
StandardScaler()

Now, transform will work.

In [12]:
# First column is 'total_bill', second column is 'size'.
tips_quant_z = stdscaler.transform(tips_quant)
tips_quant_z[:5]
Out[12]:
array([[-0.31471131, -0.60019263],
       [-1.06323531,  0.45338292],
       [ 0.1377799 ,  0.45338292],
       [ 0.4383151 , -0.60019263],
       [ 0.5407447 ,  1.50695847]])

We can also access the mean and variance stdscaler computed for each column:

In [13]:
stdscaler.mean_
Out[13]:
array([19.78594262,  2.56967213])
In [14]:
stdscaler.var_
Out[14]:
array([78.92813149,  0.9008835 ])

Note that we can call transform on DataFrames other than tips_quant. We will do this often – fit a transformer on one dataset (training data) and use it to transform other datasets (test data).

In [15]:
stdscaler.transform(tips_quant.sample(5))
Out[15]:
array([[-1.02834171, -0.60019263],
       [ 0.0792487 , -0.60019263],
       [ 0.66681191,  0.45338292],
       [ 0.97410071, -0.60019263],
       [ 0.1377799 ,  0.45338292]])

StdScaler summary¶

Property Example Description
Initialize with parameters stdscaler = StandardScaler() z-score the data (no parameters)
Fit the transformer stdscaler.fit(X) Compute the mean and SD of X
Transform data in a dataset feat = stdscaler.transform(X_new) z-score X_new with mean and SD of X

Example transformer: OneHotEncoder¶

Let's keep just the categorical columns in tips.

In [16]:
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()
Out[16]:
sex smoker day time
0 Female No Sun Dinner
1 Male No Sun Dinner
2 Male No Sun Dinner
3 Male No Sun Dinner
4 Female No Sun Dinner

Like StdScaler, we will need to fit our OneHotEncoder transformer before it can transform anything.

In [17]:
from sklearn.preprocessing import OneHotEncoder
In [18]:
ohe = OneHotEncoder()
ohe.fit(tips_cat)
Out[18]:
OneHotEncoder()

We can look at the unique values (i.e. categories) in each column by using the categories_ attribute:

In [19]:
ohe.categories_
Out[19]:
[array(['Female', 'Male'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Fri', 'Sat', 'Sun', 'Thur'], dtype=object),
 array(['Dinner', 'Lunch'], dtype=object)]
In [20]:
ohe.transform(tips_cat)
Out[20]:
<244x10 sparse matrix of type '<class 'numpy.float64'>'
	with 976 stored elements in Compressed Sparse Row format>

Since the resulting matrix is sparse – most of its elements are 0 – sklearn uses a more efficient representation than a regular numpy array. That's no issue, though:

In [21]:
ohe.transform(tips_cat).toarray()
Out[21]:
array([[1., 0., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 1., 1., 0.]])

Notice that the column names from tips_cat are no longer stored anywhere (remember, fit converts the input to a numpy array before proceeding).

We can use the get_feature_names method on ohe to access the names of the one-hot-encoded columns, though:

In [22]:
ohe.get_feature_names() # x0, x1, x2, and x3 correspond to column names in tips_cat.
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
Out[22]:
array(['x0_Female', 'x0_Male', 'x1_No', 'x1_Yes', 'x2_Fri', 'x2_Sat',
       'x2_Sun', 'x2_Thur', 'x3_Dinner', 'x3_Lunch'], dtype=object)
In [23]:
pd.DataFrame(ohe.transform(tips_cat).toarray(), 
             columns=ohe.get_feature_names()) # If we need a DataFrame back, for some reason.
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
Out[23]:
x0_Female x0_Male x1_No x1_Yes x2_Fri x2_Sat x2_Sun x2_Thur x3_Dinner x3_Lunch
0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
4 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ...
239 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
240 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
241 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
242 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
243 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0

244 rows × 10 columns

Pipelines¶


So far, we've used transformers for feature engineering and models for prediction. We can combine these steps into a single Pipeline.

Pipelines in sklearn¶

  • To instantiate a Pipeline, we must provide a list with zero or more transformers followed by a single model.
    • All "steps" must have fit methods, and all but the last must have transform methods.
    • Template: pl = Pipeline([feat_trans1, feat_trans2, ..., mdl]).

  • Once a Pipeline is instantiated, you can fit all steps (transformers and model) using a single call to the fit method.
pl.fit(X, y)
  • To make predictions using raw, untransformed data, use pl.predict.
  • The actual list we provide Pipeline with must be a list of tuples, where
    • The first element is a "name" (that we choose) for the step.
    • The second element is a transformer or estimator instance.

Our first Pipeline¶

Let's build a Pipeline that:

  • One hot encodes the categorical features in tips.
  • Fits a regression model on the one hot encoded data.
In [24]:
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()
Out[24]:
sex smoker day time
0 Female No Sun Dinner
1 Male No Sun Dinner
2 Male No Sun Dinner
3 Male No Sun Dinner
4 Female No Sun Dinner
In [25]:
from sklearn.pipeline import Pipeline
In [26]:
pl = Pipeline([
    ('one-hot', OneHotEncoder()),
    ('lin-reg', LinearRegression())
])

Now that pl is instantiated, we fit it the same way we would fit the individual steps.

In [27]:
pl.fit(tips_cat, tips['tip'])
Out[27]:
Pipeline(steps=[('one-hot', OneHotEncoder()), ('lin-reg', LinearRegression())])

Now, to make predictions using raw data, all we need to do is use pl.predict:

In [28]:
pl.predict([['Female', 'Yes', 'Sat', 'Lunch']])
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but OneHotEncoder was fitted with feature names
  warnings.warn(
Out[28]:
array([2.41792163])
In [29]:
pl.predict(tips_cat.iloc[:5])
Out[29]:
array([3.10415414, 3.27436302, 3.27436302, 3.27436302, 3.10415414])

pl performs both feature transformation and prediction with just a single call to predict!

We can access individual "steps" of a Pipeline through the named_steps attribute:

In [30]:
pl.named_steps
Out[30]:
{'one-hot': OneHotEncoder(), 'lin-reg': LinearRegression()}
In [31]:
pl.named_steps['one-hot'].transform(tips_cat).toarray()
Out[31]:
array([[1., 0., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 1., 1., 0.]])
In [32]:
pl.named_steps['lin-reg'].coef_
Out[32]:
array([-0.08510444,  0.08510444, -0.04216238,  0.04216238, -0.20256076,
       -0.12962763,  0.13756057,  0.19462781,  0.25168453, -0.25168453])

pl also has a score method, the same way a fit LinearRegression instance does:

In [33]:
pl.score(tips_cat, tips['tip'])
Out[33]:
0.027496790201475663

More sophisticated Pipelines¶

  • In the previous example, we one hot encoded every input column. What if we want to perform different transformations on different columns?
  • Solution: Use a ColumnTransformer.
    • Instantiate a ColumnTransformer using a list of tuples, where:
      • The first element is a "name" we choose for the transformer.
      • The second element is a transformer instance (e.g. OneHotEncoder()).
      • The third element is a list of relevant column names.
  • ColumnTransformer is extremely useful, but it was only added to sklearn in 2018!

Planning our first ColumnTransformer¶

In [34]:
from sklearn.compose import ColumnTransformer

Let's perform different transformations on the quantitative and categorical features of tips (note that we are not transforming 'tip').

In [35]:
tips_features = tips.drop('tip', axis=1)
tips_features.head()
Out[35]:
total_bill sex smoker day time size
0 16.99 Female No Sun Dinner 2
1 10.34 Male No Sun Dinner 3
2 21.01 Male No Sun Dinner 3
3 23.68 Male No Sun Dinner 2
4 24.59 Female No Sun Dinner 4
  • We will leave the 'total_bill' column untouched.
  • To the 'size' column, we will apply the Binarizer transformer with a threshold of 2 (big tables vs. small tables).
  • To the categorical columns, we will apply the OneHotEncoder transformer.
  • In essence, we will create a transformer that reproduces the following DataFrame:
size x0_Female x0_Male x1_No x1_Yes x2_Fri x2_Sat x2_Sun x2_Thur x3_Dinner x3_Lunch total_bill
0 0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 16.99
1 1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 10.34
2 1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 21.01
3 0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 23.68
4 1 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 24.59

Building a Pipeline using a ColumnTransformer¶

Let's start by creating our ColumnTransformer.

In [36]:
preproc = ColumnTransformer(
    transformers=[
        ('size', Binarizer(threshold=2), ['size']),
        ('categorical_cols', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
    ],
    remainder='passthrough' # Specify what to do with all other columns ('total_bill' here) – drop or passthrough.
)

Now, let's create a Pipeline using preproc as a transformer, and fit it:

In [37]:
pl = Pipeline([
    ('preprocessor', preproc), 
    ('lin-reg', LinearRegression())
])
In [38]:
pl.fit(tips_features, tips['tip'])
Out[38]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('size',
                                                  Binarizer(threshold=2),
                                                  ['size']),
                                                 ('categorical_cols',
                                                  OneHotEncoder(),
                                                  ['sex', 'smoker', 'day',
                                                   'time'])])),
                ('lin-reg', LinearRegression())])

Prediction is as easy as calling predict:

In [39]:
tips_features.head()
Out[39]:
total_bill sex smoker day time size
0 16.99 Female No Sun Dinner 2
1 10.34 Male No Sun Dinner 3
2 21.01 Male No Sun Dinner 3
3 23.68 Male No Sun Dinner 2
4 24.59 Female No Sun Dinner 4
In [40]:
# Note that we fit the Pipeline using tips_features, not tips_features.head()!
pl.predict(tips_features.head())
Out[40]:
array([2.73813307, 2.32343202, 3.3700388 , 3.36798392, 3.74755924])

We can even call each transformer in pl['preprocessor'] individually to re-create the transformed DataFrame. (There's no practical reason to do this, it's more for illustration.)

In [41]:
dfs = []
for trans in pl['preprocessor'].transformers_:
    if isinstance(trans[1], str) and trans[1] == 'passthrough':
        df = tips_features.iloc[:, trans[2]]
    else:
        vals = trans[1].transform(tips_features[trans[2]])
        columns = trans[2]
        if str(trans[1]) == 'OneHotEncoder()':
            vals = vals.toarray()
            columns = trans[1].get_feature_names()
        df = pd.DataFrame(vals, columns=columns)
    dfs.append(df)

pd.concat(dfs, axis=1)
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
Out[41]:
size x0_Female x0_Male x1_No x1_Yes x2_Fri x2_Sat x2_Sun x2_Thur x3_Dinner x3_Lunch total_bill
0 0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 16.99
1 1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 10.34
2 1 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 21.01
3 0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 23.68
4 1 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 24.59
... ... ... ... ... ... ... ... ... ... ... ... ...
239 1 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 29.03
240 0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 27.18
241 0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 22.67
242 0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 17.82
243 0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 18.78

244 rows × 12 columns

Aside: FunctionTransformer¶

A transformer you'll often use as part of a ColumnTransformer is the FunctionTransformer, which enables you to use your own functions on entire columns. Think of it as the sklearn equivalent of apply.

In [42]:
from sklearn.preprocessing import FunctionTransformer
In [43]:
f = FunctionTransformer(np.sqrt)
f.transform([1, 2, 3])
Out[43]:
array([1.        , 1.41421356, 1.73205081])

Summary: Pipelines¶

  • Pipelines are powerful because they allow you to perform feature engineering and training/prediction all through a single object.
  • It's important to understand what each step of a Pipeline does. Neural networks work similarly to sklearn Pipelines, in that they follow a well-defined sequence of steps to make predictions.

Generalization¶

Motivation¶

  • You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a practice exam.
    • Your logic: If you do well on the practice exam, you should do well on the real exam.
  • You each take the practice exam once and look at the solutions afterwards.
  • Your strategy: Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."
  • Billy's strategy: Learn high-level concepts from the solutions, e.g. "data are NMAR if the likelihood of missingness depends on the missing values themselves."
  • Who will do better on the practice exam? Who will probably do better on the real exam? 🧐

Evaluating the quality of a model¶

  • So far, we've computed the RMSE (and $R^2$) of our fit regression models on the data that we used to fit them, i.e. the training data.
  • We've said that Model A is better than Model B if Model A's RMSE is lower than Model B's RMSE.
    • Remember, our training data is a sample from the data generating process.
    • Just because a model fits the training data well doesn't mean it will generalize and work well on similar, unseen samples!

Example: Overfitting and underfitting¶

Let's collect two samples $\{(x_i, y_i)\}$ from the same data generating process.

In [44]:
np.random.seed(23) # For reproducibility.

def sample_dgp(n=100):
    x = np.linspace(-2, 3, n)
    y = x ** 3 + (np.random.normal(0, 3, size=n))
    return pd.DataFrame({'x': x, 'y': y})

sample_1 = sample_dgp()
sample_2 = sample_dgp()

For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly cubic; that is, $y \approx x^3$ (remember, in reality, you won't get to see the DGP).

In [45]:
px.scatter(sample_1, x='x', y='y', title='Sample 1', template=TEMPLATE)

Polynomial regression¶

Let's fit three polynomial models on Sample 1:

  • Degree 1.
  • Degree 3.
  • Degree 25.

The PolynomialFeatures transformer will be helpful here.

In [46]:
from sklearn.preprocessing import PolynomialFeatures
In [47]:
# fit_transform fits and transforms the same input.
d2 = PolynomialFeatures(3)
d2.fit_transform(np.array([1, 2, 3, 4, -2]).reshape(-1, 1))
Out[47]:
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  2.,  4.,  8.],
       [ 1.,  3.,  9., 27.],
       [ 1.,  4., 16., 64.],
       [ 1., -2.,  4., -8.]])

Below, we look at our three models' predictions on Sample 1 (which they were trained on).

In [48]:
# Look at the definition of train_and_plot in util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25])
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')

The degree 25 polynomial has the lowest RMSE on Sample 1.

How do the same fit polynomials look on Sample 2?

In [49]:
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25])
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')
  • The degree 3 polynomial has the lowest RMSE on Sample 2.
  • Note that we didn't get to see Sample 2 when fitting our models!
  • As such, it seems that the degree 3 polynomial generalizes better to unseen data than the degree 25 polynomial does.

What if we fit a degree 1, degree 3, and degree 25 polynomial on Sample 2 as well?

In [50]:
util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25])

Key idea: Degree 25 polynomials seem to vary more when trained on different samples than degree 3 and 1 polynomials do.

Bias and variance¶

The training data we have access to is a sample from the DGP. We are concerned with our model's ability to generalize and work well on different datasets drawn from the same DGP.

Suppose we fit a model $H$ (e.g. a degree 3 polynomial) on several different datasets from a DGP. There are three sources of error that arise:

  • ⭐️ Bias: The expected deviation between a predicted value and an actual value.
    • In other words, for a given $x_i$, how far is $H(x_i)$ from the true $y_i$, on average?
    • Low bias is good! ✅
    • High bias is a sign of underfitting, i.e. that our model is too basic to capture the relationship between our features and response.
  • ⭐️ Model variance ("variance"): The variance of a model's predictions.
    • In other words, for a given $x_i$, what is the variance of $H(x_i)$ across all datasets?
    • Low model variance is good! ✅
    • High model variance is a sign of overfitting, i.e. that our model is too complicated and is prone to fitting to the noise in our training data.
  • Observation variance: The variance due to the random noise in the process we are trying to model (e.g. measurement error). We can't control this, without collecting more data!

Here, suppose:

  • The red bulls-eye represents your true weight and height 🧍.
  • The dark blue darts represent predictions of your weight and height using different models that were fit on the same DGP.

We'd like our models to be in the top left, but in practice that's hard to achieve!

Summary, next time¶

Summary¶

  • Pipelines in sklearn combine one or more transformers with a single model (estimator), allowing us to perform feature engineering and prediction through a single object.
  • We want to build models that generalize well to unseen data.
    • Models that have high bias are too simple to represent complex relationships in data, and underfit.
    • Models that have high variance are overly complex for the relationships in the data, and vary a lot when fit on different datasets. Such models overfit to the training data.

Next time¶

How do we choose the right model complexity, so that our model has the right "balance" between bias and variance?