In [1]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'

import seaborn as sns

from sklearn.linear_model import LinearRegression

Lecture 21 – Feature Engineering¶

DSC 80, Winter 2023¶

📣 Announcements¶

  • Project 4 is due on Thursday, March 9th at 11:59PM.
  • Lab 8 (modeling) is due on Monday, March 6th at 11:59PM.
  • RSVP to the Senior Capstone Showcase on March 15th at hdsishowcase.com.
    • There is no live lecture for DSC 80 on the day of the showcase.

Agenda¶

  • Case study: Restaurant tips 🧑‍🍳.
    • Other methods for evaluating regression models.
  • Feature engineering.
    • One hot encoding.
    • Encoding categorical features, both nominal and ordinal.
    • Quantitative scaling.

Case study: Restaurant tips 🧑‍🍳¶

In [2]:
# The dataset is built into plotly (and seaborn)!
# We shuffle here so that the head of the DataFrame contains rows where smoker is Yes and smoker is No,
# purely for illustration purposes (it doesn't change any of the math).
np.random.seed(1)
tips = px.data.tips().sample(frac=1).reset_index(drop=True)
tips.head()
Out[2]:
total_bill tip sex smoker day time size
0 3.07 1.00 Female Yes Sat Dinner 1
1 18.78 3.00 Female No Thur Dinner 2
2 26.59 3.41 Male Yes Sat Dinner 3
3 14.26 2.50 Male No Thur Lunch 2
4 21.16 3.00 Male No Thur Lunch 2

Model #1: Constant¶

Let's suppose we choose squared loss, meaning that $h^* = \text{mean}(y)$.

In [3]:
mean_tip = tips['tip'].mean()
mean_tip
Out[3]:
2.99827868852459
In [4]:
# Unfortunately, the code to visualize a scatter plot and a line
# in plotly is not all that concise.
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=tips['total_bill'], 
    y=tips['tip'], 
    mode='markers',
    name='Original Data')
)

fig.add_trace(go.Scatter(
    x=[0, 60],
    y=[mean_tip, mean_tip],
    mode='lines',
    name='Constant Prediction (Mean)'
))

fig.update_layout(showlegend=True, title='Tip vs. Total Bill',
                  xaxis_title='Total Bill', yaxis_title='Tip',
                  template=TEMPLATE)
fig.update_xaxes(range=[0, 60])

Let's compute the RMSE of our constant tip's predictions, and store it in a dictionary that we can refer to later on.

In [5]:
def rmse(actual, pred):
    return np.sqrt(np.mean((actual - pred) ** 2))
In [6]:
rmse_dict = {}
rmse_dict['constant tip amount'] = rmse(tips['tip'], mean_tip)
rmse_dict
Out[6]:
{'constant tip amount': 1.3807999538298952}

Model #2: Simple linear regression using total bill¶

We can fit a simple linear model to predict tips as a function of total bill:

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill}$$
In [7]:
model = LinearRegression()
model.fit(X=tips[['total_bill']], y=tips['tip'])
Out[7]:
LinearRegression()
In [8]:
fig.add_trace(go.Scatter(
    x=[0, 60],
    y=model.predict([[0], [60]]),
    mode='lines',
    name='Linear: Total Bill Only'
))
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/base.py:441: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

In [9]:
all_preds = model.predict(tips[['total_bill']])
rmse_dict['one feature: total bill'] = rmse(tips['tip'], all_preds)
rmse_dict
Out[9]:
{'constant tip amount': 1.3807999538298952,
 'one feature: total bill': 1.0178504025697377}

The RMSE of our simple linear model is lower than that of our constant model, which means it does a better job at modeling the training data than our constant model.

Model #3: Multiple linear regression using total bill and table size¶

Let's try using another feature – table size. Such a model would predict tips using:

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size}$$
In [10]:
model_two = LinearRegression()
model_two.fit(X=tips[['total_bill', 'size']], y=tips['tip'])
Out[10]:
LinearRegression()
In [11]:
model_two.predict([[25, 4]])
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/base.py:441: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Out[11]:
array([3.75716934])

What does this model look like?

Plane of best fit ✈️¶

Here, we must draw a 3D scatter plot and plane, with one axis for total bill, one axis for table size, and one axis for tip. The code below does this.

In [12]:
XX, YY = np.mgrid[0:50:2, 0:8:1]
Z = model_two.intercept_ + model_two.coef_[0] * XX + model_two.coef_[1] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Oranges')

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=tips['total_bill'], 
                           y=tips['size'], 
                           z=tips['tip'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title='Total Bill',
    yaxis_title='Table Size',
    zaxis_title='Tip'),
  title='Tip vs. Total Bill and Table Size',
    width=1000, height=800,
    template=TEMPLATE)

Comparing models, again¶

How does our two-feature linear model stack up to our single feature linear model and our constant model?

In [13]:
rmse_dict['two features'] = rmse(
    tips['tip'], model_two.predict(tips[['total_bill', 'size']])
)
In [14]:
rmse_dict
Out[14]:
{'constant tip amount': 1.3807999538298952,
 'one feature: total bill': 1.0178504025697377,
 'two features': 1.007256127114662}
  • The RMSE of our two-feature model is the lowest of the three models we've looked at so far, but not by much. We didn't gain much by adding table size to our linear model.
  • It's also not clear whether table sizes are practically useful in predicting tips.

The .score method of a LinearRegression object¶

Model objects in sklearn that have already been fit have a score method.

In [15]:
model_two.score(tips[['total_bill', 'size']], tips['tip'])
Out[15]:
0.46786930879612565

That doesn't look like the RMSE... what is it? 🤔

Aside: $R^2$¶

  • $R^2$, or the coefficient of determination, is a measure of the quality of a linear fit.
  • There are a few equivalent ways of computing it, assuming your model is linear and has an intercept term:
$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$
  • Interpretation: $R^2$ is the proportion of variance in $y$ that the linear model explains.
  • In the simple linear regression case, it is the square of the correlation coefficient, $r$.
  • Key idea: $R^2$ ranges from 0 to 1. The closer it is to 1, the better the linear fit is.
    • $R^2$ has no units of measurement, unlike RMSE.

Calculating $R^2$¶

all_preds contains model_two's predicted 'tip' for every row in tips.

In [16]:
tips.head()
Out[16]:
total_bill tip sex smoker day time size
0 3.07 1.00 Female Yes Sat Dinner 1
1 18.78 3.00 Female No Thur Dinner 2
2 26.59 3.41 Male Yes Sat Dinner 3
3 14.26 2.50 Male No Thur Lunch 2
4 21.16 3.00 Male No Thur Lunch 2
In [17]:
all_preds = model_two.predict(tips[['total_bill', 'size']])
all_preds[:5]
Out[17]:
array([1.14617248, 2.7952968 , 3.71198575, 2.37623251, 3.01595454])

Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$

In [18]:
np.var(all_preds) / np.var(tips['tip'])
Out[18]:
0.46786930879612576

Method 2: $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$

Note: By correlation here, we are referring to $r$, the same correlation coefficient you saw in DSC 10.

In [19]:
(np.corrcoef(all_preds, tips['tip'])) ** 2
Out[19]:
array([[1.        , 0.46786931],
       [0.46786931, 1.        ]])

Method 3: LinearRegression.score

In [20]:
model_two.score(tips[['total_bill', 'size']], tips['tip'])
Out[20]:
0.46786930879612565

All three methods provide the same result!

LinearRegression summary¶

Property Example Description
Initialize model parameters lr = LinearRegression() Create (empty) linear regression model
Fit the model to the data lr.fit(X, y) Determines regression coefficients
Use model for prediction lr.predict(X_new) Uses regression line to make predictions
Evaluate the model lr.score(X, y) Calculates the $R^2$ of the LR model
Access model attributes lr.coef_, lr.intercept_ Accesses the regression coefficients and intercept

What's next?¶

In [21]:
tips.head()
Out[21]:
total_bill tip sex smoker day time size
0 3.07 1.00 Female Yes Sat Dinner 1
1 18.78 3.00 Female No Thur Dinner 2
2 26.59 3.41 Male Yes Sat Dinner 3
3 14.26 2.50 Male No Thur Lunch 2
4 21.16 3.00 Male No Thur Lunch 2
  • So far, in our journey to predict 'tip', we've only used the existing numerical features in our dataset, 'total_bill' and 'size'.
  • There's a lot of information in tips that we didn't use – 'sex', 'smoker', 'day', and 'time', for example. We can't use these features in their current form, because they're non-numeric.
  • How do we use categorical features in a regression model?

Feature engineering ⚙️¶

The goal of feature engineering¶

  • Feature engineering is the act of finding transformations that transform data into effective quantitative variables.
  • A feature function $\phi$ (phi, pronounced "fea") is a mapping from raw data to $d$-dimensional space, i.e. $\phi: \text{raw data} \rightarrow \mathbb{R}^d$.
    • If two observations $x_i$ and $x_j$ are "similar" in the raw data space, then $\phi(x_i)$ and $\phi(x_j)$ should also be "similar."
  • A "good" choice of features depends on many factors:
    • The kind of data (quantitative, ordinal, nominal).
    • The relationship(s) and association(s) being modeled.
    • The model type (e.g. linear models, decision tree models, neural networks).

One hot encoding¶

  • One hot encoding is a transformation that turns a categorical feature into several binary features.
  • Suppose column 'col' has $N$ unique values, $A_1$, $A_2$, ..., $A_N$. For each unique value $A_i$, we define the following feature function:
$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x = A_i \\ 0 & {\rm if\ } x\neq A_i \\ \end{array}\right. $$
  • Note that 1 means "yes" and 0 means "no".
  • One hot encoding is also called "dummy encoding", and $\phi(x)$ may also be referred to as an "indicator variable".

Example: One hot encoding 'smoker'¶

For each unique value of 'smoker' in our dataset, we must create a column for just that 'smoker'. (Remember, 'smoker' is 'Yes' when the table was in the smoking section of the restaurant and 'No' otherwise.)

In [22]:
tips.head()
Out[22]:
total_bill tip sex smoker day time size
0 3.07 1.00 Female Yes Sat Dinner 1
1 18.78 3.00 Female No Thur Dinner 2
2 26.59 3.41 Male Yes Sat Dinner 3
3 14.26 2.50 Male No Thur Lunch 2
4 21.16 3.00 Male No Thur Lunch 2
In [23]:
tips['smoker'].value_counts()
Out[23]:
No     151
Yes     93
Name: smoker, dtype: int64
In [24]:
(tips['smoker'] == 'Yes').astype(int).head()
Out[24]:
0    1
1    0
2    1
3    0
4    0
Name: smoker, dtype: int64
In [25]:
for val in tips['smoker'].unique():
    tips[f'smoker == {val}'] = (tips['smoker'] == val).astype(int)
In [26]:
tips.head()
Out[26]:
total_bill tip sex smoker day time size smoker == Yes smoker == No
0 3.07 1.00 Female Yes Sat Dinner 1 1 0
1 18.78 3.00 Female No Thur Dinner 2 0 1
2 26.59 3.41 Male Yes Sat Dinner 3 1 0
3 14.26 2.50 Male No Thur Lunch 2 0 1
4 21.16 3.00 Male No Thur Lunch 2 0 1

Model #4: Multiple linear regression using total bill, table size, and smoker status¶

Now that we've converted 'smoker' to a numerical variable, we can use it as input in a regression model. Here's the model we'll try to fit:

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size} + w_3 \cdot \text{smoker == Yes}$$

Subtlety: There's no need to use both 'smoker == No' and 'smoker == Yes'. If we know the value of one, we already know the value of the other. We can use either one.

In [27]:
model_three = LinearRegression()
model_three.fit(tips[['total_bill', 'size', 'smoker == Yes']], tips['tip'])
Out[27]:
LinearRegression()

The following cell gives us our $w^*$s:

In [28]:
model_three.intercept_, model_three.coef_
Out[28]:
(0.7090155167346053, array([ 0.09388839,  0.18033156, -0.08343255]))

Thus, our trained linear model to predict tips given total bills, table sizes, and smoker status (yes or no) is:

$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$

Visualizing Model #4¶

Our new fit model is:

$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$

To visualize our data and linear model, we'd need 4 dimensions:

  • One for total bill
  • One for table size
  • One for 'smoker == Yes'.
  • One for tip.

Humans can't visualize in 4D, but there may be a solution. We know that 'smoker == Yes' only has two possible values, 1 or 0, so let's look at those cases separately.

Case 1: 'smoker == Yes' is 1, meaning that the table was in the smoking section.

$$\begin{align*} \text{predicted tip} &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot 1 \\ &= 0.626 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} \end{align*}$$

Case 2: 'smoker == Yes' is 0, meaning that the table was not in the smoking section.

$$\begin{align*} \text{predicted tip} &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot 0 \\ &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} \end{align*}$$

Key idea: These are two parallel planes in 3D, with different $z$-intercepts!

Note that the two planes are very close to one another – you'll have to zoom in to see the difference.

In [29]:
XX, YY = np.mgrid[0:50:2, 0:8:1]
Z_0 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 0
Z_1 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 1
plane_0 = go.Surface(x=XX, y=YY, z=Z_0, colorscale='Greens')
plane_1 = go.Surface(x=XX, y=YY, z=Z_1, colorscale='Purples')

fig = go.Figure(data=[plane_0, plane_1])

tips_0 = tips[tips['smoker'] == 'No']
tips_1 = tips[tips['smoker'] == 'Yes']

fig.add_trace(go.Scatter3d(x=tips_0['total_bill'], 
                           y=tips_0['size'], 
                           z=tips_0['tip'], mode='markers', marker = {'color': 'green'}))

fig.add_trace(go.Scatter3d(x=tips_1['total_bill'], 
                           y=tips_1['size'], 
                           z=tips_1['tip'], mode='markers', marker = {'color': 'purple'}))

fig.update_layout(scene = dict(
    xaxis_title='Total Bill',
    yaxis_title='Table Size',
    zaxis_title='Tip'),
  title='Tip vs. Total Bill and Table Size (Green = Non-Smoking Section, Purple = Smoking Section)',
    width=1000, height=800,
    showlegend=False,
    template=TEMPLATE)

If we want to visualize in 2D, we need to pick a single feature to place on the $x$-axis.

In [30]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=tips['tip'], 
                         mode='markers', name='Original Data'))
fig.add_trace(go.Scatter(x=tips['total_bill'], y=model_three.predict(tips[['total_bill', 'size', 'smoker == Yes']]), 
                         mode='markers', name='Predicted Tips using Total Bill, Table Size, and Smoker Status'))

fig.update_layout(showlegend=True, template=TEMPLATE, title='Tip vs. Total Bill',
                  xaxis_title='Total Bill', yaxis_title='Tip')

Despite being a linear model, why doesn't this model look like a straight line?

Comparing Model #4 to earlier models¶

In [31]:
rmse_dict['three features'] = rmse(tips['tip'], 
                                   model_three.predict(tips[['total_bill', 'size', 'smoker == Yes']]))
rmse_dict
Out[31]:
{'constant tip amount': 1.3807999538298952,
 'one feature: total bill': 1.0178504025697377,
 'two features': 1.007256127114662,
 'three features': 1.0064899786822128}

Adding 'smoker == Yes' decreased the training RMSE of our model, but barely.

Reflection¶

In [32]:
tips.head()
Out[32]:
total_bill tip sex smoker day time size smoker == Yes smoker == No
0 3.07 1.00 Female Yes Sat Dinner 1 1 0
1 18.78 3.00 Female No Thur Dinner 2 0 1
2 26.59 3.41 Male Yes Sat Dinner 3 1 0
3 14.26 2.50 Male No Thur Lunch 2 0 1
4 21.16 3.00 Male No Thur Lunch 2 0 1
  • We've one hot encoded 'smoker', but it required a for-loop.
  • Is there an easy way to one hot encode all four categorical columns – 'sex', 'smoker', 'day', and 'time' – all at once, without using a for-loop?
  • Yes, using sklearn.preprocessing's OneHotEncoder. More on this in the next lecture!

Example: Predicting ratings ⭐️¶

Example: Predicting ratings ⭐️¶

UID AGE STATE HAS_BOUGHT REVIEW \
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." | ✩
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)
  • We want to build a multiple regression model that predicts 'RATING' using the above features.
  • Why can't we build a model right away? What must we do so that we can build a model?
  • Some issues: Missing values, emojis and strings instead of numbers, unrelated columns.

Uninformative features¶

  • 'UID' was likely used to join the user information (e.g., 'AGE' and 'STATE') with some reviews dataset.
  • Even though 'UID's are stored as numbers, the numerical value of a user's 'UID' won't help us predict their 'RATING'.
  • If we include the 'UID' feature, our model will find whatever patterns it can between 'UID's and 'RATING's in the training (observed data).
    • This will lead to a lower training RMSE.
  • However, since there is truly no relationship between 'UID' and 'RATING', this will lead to worse model performance on unseen data (bad).
  • Transformation: drop 'UID'.

Dropping features¶

There are certain scenarios where manually dropping features might be helpful:

  1. When the features do not contain information associated with the prediction task.
  2. When the feature is not available at prediction time.
  • The goal of building a model to predict 'RATING's is so that we can predict 'RATING's for users who haven't actually made a 'RATING's yet.
  • As such, our model should only depend on features that we would know before the user makes their 'RATING'.
  • For instance, if users only enter 'REVIEW's after entering 'RATING's, we shouldn't use 'REVIEW's as a feature.

Encoding ordinal features¶

UID AGE STATE HAS_BOUGHT REVIEW \
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." | ✩
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

How do we encode the 'RATING' column, an ordinal variable, as a quantitative variable?

  • Transformation: Replace "number of ✩" with "number".
    • This is an ordinal encoding, a transformation that maps ordinal values to the positive integers in a way that preserves order.
    • Example: (freshman, sophomore, junior, senior) -> (0, 1, 2, 3).
    • Important: This transformation preserves "distances" between ratings.
In [33]:
order_values = ['✩', '✩✩', '✩✩✩', '✩✩✩✩', '✩✩✩✩✩']
ordinal_enc = {y:x + 1 for (x, y) in enumerate(order_values)}
ordinal_enc
Out[33]:
{'✩': 1, '✩✩': 2, '✩✩✩': 3, '✩✩✩✩': 4, '✩✩✩✩✩': 5}
In [34]:
ratings = pd.DataFrame().assign(RATING=['✩', '✩✩', '✩✩✩', '✩✩', '✩✩✩', '✩', '✩✩✩', '✩✩✩✩', '✩✩✩✩✩'])
ratings
Out[34]:
RATING
0 ✩
1 ✩✩
2 ✩✩✩
3 ✩✩
4 ✩✩✩
5 ✩
6 ✩✩✩
7 ✩✩✩✩
8 ✩✩✩✩✩
In [35]:
ratings.replace(ordinal_enc)
Out[35]:
RATING
0 1
1 2
2 3
3 2
4 3
5 1
6 3
7 4
8 5

Encoding nominal features¶

UID AGE STATE HAS_BOUGHT REVIEW \
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." | ✩
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

How do we encode the 'STATE' column, a nominal variable, as a quantitative variable? In other words, how do we turn 'STATE's into meaningful numbers?

  • Question: Why can't we use an ordinal encoding, e.g. NY -> 0, WA -> 1?
  • Answer: There is no inherent ordering to states, e.g. WA is not inherently "more" of anything than NY.
  • We've already seen the correct strategy: one hot encoding.

Example: Horsepower 🚗¶

The following dataset, built into the seaborn plotting library, contains various information about (older) cars.

In [36]:
mpg = sns.load_dataset('mpg').dropna()
mpg.head()
Out[36]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino

We really do mean old:

In [37]:
mpg['model_year'].value_counts()
Out[37]:
73    40
78    36
76    34
75    30
82    30
70    29
79    29
72    28
77    28
81    28
71    27
80    27
74    26
Name: model_year, dtype: int64

Let's investigate the relationship between 'horsepower' and 'mpg'.

The relationship between 'horsepower' and 'mpg'¶

In [38]:
# Note: To create a simple scatter plot, all you need is
# px.scatter(mpg, x='horsepower', y='mpg').
# We've used the more complicated go.Scatter approach here so that we can add
# other lines on top.

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=mpg['horsepower'], 
    y=mpg['mpg'], 
    mode='markers',
    name='Original Data')
)

fig.update_layout(showlegend=True, title='MPG vs. Horsepower',
                  xaxis_title='Horsepower', yaxis_title='MPG',
                  template=TEMPLATE)
  • It appears that there is a negative association between 'horsepower' and 'mpg', though it's not quite linear.
  • Let's try and fit a simple linear model that uses 'horsepower' to predict 'mpg' and see what happens.

Predicting 'mpg' using 'horsepower'¶

In [39]:
car_model = LinearRegression()
car_model.fit(mpg[['horsepower']], mpg['mpg'])
Out[39]:
LinearRegression()

What do our predictions look like?

In [40]:
fig.add_trace(go.Scatter(
    x=[25, 225],
    y=car_model.predict([[25], [225]]),
    mode='lines',
    name='Predicted MPG using Horsepower'
))
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/base.py:441: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Our regression line doesn't quite capture the curvature in the relationship between 'horsepower' and 'mpg'.

Let's compute the $R^2$ of car_model on our training data, for reference:

In [41]:
car_model.score(mpg[['horsepower']], mpg['mpg'])
Out[41]:
0.6059482578894348

Transformations¶

The Tukey Mosteller Bulge Diagram helps us pick which transformations to apply to data in order to linearize it.

The bottom-left quadrant appears to match the shape of the scatter plot between 'horsepower' and 'mpg' the best – let's try taking the log of 'horsepower' ($X$).

In [42]:
mpg['log hp'] = np.log(mpg['horsepower'])

What does our data look like now?

In [43]:
log_fig = go.Figure()

log_fig.add_trace(go.Scatter(
    x=mpg['log hp'], 
    y=mpg['mpg'], 
    mode='markers',
    name='Original Data')
)

log_fig.update_layout(showlegend=True, title='MPG vs. log(Horsepower)',
                  xaxis_title='log(Horsepower)', yaxis_title='MPG',
                  template=TEMPLATE)

Predicting 'mpg' using log('horsepower')¶

Let's fit another linear model.

In [44]:
car_model_log = LinearRegression()
car_model_log.fit(mpg[['log hp']], mpg['mpg'])
Out[44]:
LinearRegression()

What do our predictions look like now?

In [45]:
log_fig.add_trace(go.Scatter(
    x=[3.7, 5.5],
    y=car_model_log.predict([[3.7], [5.5]]),
    mode='lines',
    name='Predicted MPG using log(Horsepower)'
))
/Users/larry/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/base.py:441: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

The fit looks a bit better! How about the $R^2$?

In [46]:
car_model_log.score(mpg[['log hp']], mpg['mpg'])
Out[46]:
0.6683347641192137

Also a bit better!

What do our predictions look like on the original, non-transformed scatter plot? Let's see:

In [47]:
fig.add_trace(
    go.Scatter(
        x=mpg['horsepower'], 
        y=car_model_log.intercept_ + car_model_log.coef_[0] * np.log(mpg['horsepower']),  
        mode='markers', name='Predicted MPG using log(Horsepower)', marker_color='red'
    )
)

Our predictions that used $\log(\text{Horsepower})$ as an input don't fall on a straight line. We shouldn't expect them to; the red dots come from:

$$\text{Predicted MPG} = 108.698 - 18.582 \cdot \log(\text{Horsepower})$$
In [48]:
car_model_log.intercept_, car_model_log.coef_
Out[48]:
(108.69970699574486, array([-18.58218476]))

Quantitative scaling¶

Until now, feature transformations we've discussed so far have involved converting categorical variables into quantitative variables. However, our log transformation was an example of transforming a quantitative variable into a new quantitative variable; this practice is called quantitative scaling.

  • Standardization: $x_i \rightarrow \frac{x_i - \bar{x}}{\sigma_x}$.
  • Linearization via a non-linear transformation: e.g. $\text{log}$ and $\text{sqrt}$. See Lab 8 for more.
  • Discretization: Convert data into percentiles (or more generally, quantiles).

Summary, next time¶

Summary¶

  • The LinearRegression class in sklearn.linear_model provides an implementation of least squares linear regression that works with multiple features.
  • To transform a categorical nominal variable into a quantitative variable, use one hot encoding.
  • To transform a categorical ordinal variable into a quantitative variable, use an ordinal encoding.
  • Quantitative feature transformations allow us to use linear models to model non-linear data.

Next time¶

  • Performing one hot encoding and other feature engineering steps in sklearn directly.
  • Using sklearn Pipelines to engineer features and fit models all through a single object.