Lecture 21 – Feature Engineering

DSC 80, Winter 2023

📣 Announcements

Agenda

Case study: Restaurant tips 🧑‍🍳

Model #1: Constant

Let's suppose we choose squared loss, meaning that $h^* = \text{mean}(y)$.

Let's compute the RMSE of our constant tip's predictions, and store it in a dictionary that we can refer to later on.

Model #2: Simple linear regression using total bill

We can fit a simple linear model to predict tips as a function of total bill:

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill}$$

The RMSE of our simple linear model is lower than that of our constant model, which means it does a better job at modeling the training data than our constant model.

Model #3: Multiple linear regression using total bill and table size

Let's try using another feature – table size. Such a model would predict tips using:

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size}$$

What does this model look like?

Plane of best fit ✈️

Here, we must draw a 3D scatter plot and plane, with one axis for total bill, one axis for table size, and one axis for tip. The code below does this.

Comparing models, again

How does our two-feature linear model stack up to our single feature linear model and our constant model?

The .score method of a LinearRegression object

Model objects in sklearn that have already been fit have a score method.

That doesn't look like the RMSE... what is it? 🤔

Aside: $R^2$

$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$

Calculating $R^2$

all_preds contains model_two's predicted 'tip' for every row in tips.

Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$

Method 2: $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$

Note: By correlation here, we are referring to $r$, the same correlation coefficient you saw in DSC 10.

Method 3: LinearRegression.score

All three methods provide the same result!

LinearRegression summary

Property Example Description
Initialize model parameters lr = LinearRegression() Create (empty) linear regression model
Fit the model to the data lr.fit(X, y) Determines regression coefficients
Use model for prediction lr.predict(X_new) Uses regression line to make predictions
Evaluate the model lr.score(X, y) Calculates the $R^2$ of the LR model
Access model attributes lr.coef_, lr.intercept_ Accesses the regression coefficients and intercept

What's next?

Feature engineering ⚙️

The goal of feature engineering

One hot encoding

$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x = A_i \\ 0 & {\rm if\ } x\neq A_i \\ \end{array}\right. $$

Example: One hot encoding 'smoker'

For each unique value of 'smoker' in our dataset, we must create a column for just that 'smoker'. (Remember, 'smoker' is 'Yes' when the table was in the smoking section of the restaurant and 'No' otherwise.)

Model #4: Multiple linear regression using total bill, table size, and smoker status

Now that we've converted 'smoker' to a numerical variable, we can use it as input in a regression model. Here's the model we'll try to fit:

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size} + w_3 \cdot \text{smoker == Yes}$$

Subtlety: There's no need to use both 'smoker == No' and 'smoker == Yes'. If we know the value of one, we already know the value of the other. We can use either one.

The following cell gives us our $w^*$s:

Thus, our trained linear model to predict tips given total bills, table sizes, and smoker status (yes or no) is:

$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$

Visualizing Model #4

Our new fit model is:

$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$

To visualize our data and linear model, we'd need 4 dimensions:

Humans can't visualize in 4D, but there may be a solution. We know that 'smoker == Yes' only has two possible values, 1 or 0, so let's look at those cases separately.

Case 1: 'smoker == Yes' is 1, meaning that the table was in the smoking section.

$$\begin{align*} \text{predicted tip} &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot 1 \\ &= 0.626 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} \end{align*}$$

Case 2: 'smoker == Yes' is 0, meaning that the table was not in the smoking section.

$$\begin{align*} \text{predicted tip} &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot 0 \\ &= 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} \end{align*}$$

Key idea: These are two parallel planes in 3D, with different $z$-intercepts!

Note that the two planes are very close to one another – you'll have to zoom in to see the difference.

If we want to visualize in 2D, we need to pick a single feature to place on the $x$-axis.

Despite being a linear model, why doesn't this model look like a straight line?

Comparing Model #4 to earlier models

Adding 'smoker == Yes' decreased the training RMSE of our model, but barely.

Reflection

Example: Predicting ratings ⭐️

Example: Predicting ratings ⭐️

UID AGE STATE HAS_BOUGHT REVIEW \ RATING
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." |
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

Uninformative features

Dropping features

There are certain scenarios where manually dropping features might be helpful:

  1. When the features do not contain information associated with the prediction task.
  2. When the feature is not available at prediction time.

Encoding ordinal features

UID AGE STATE HAS_BOUGHT REVIEW \ RATING
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." |
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

How do we encode the 'RATING' column, an ordinal variable, as a quantitative variable?

Encoding nominal features

UID AGE STATE HAS_BOUGHT REVIEW \ RATING
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." |
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

How do we encode the 'STATE' column, a nominal variable, as a quantitative variable? In other words, how do we turn 'STATE's into meaningful numbers?

Example: Horsepower 🚗

The following dataset, built into the seaborn plotting library, contains various information about (older) cars.

We really do mean old:

Let's investigate the relationship between 'horsepower' and 'mpg'.

The relationship between 'horsepower' and 'mpg'

Predicting 'mpg' using 'horsepower'

What do our predictions look like?

Our regression line doesn't quite capture the curvature in the relationship between 'horsepower' and 'mpg'.

Let's compute the $R^2$ of car_model on our training data, for reference:

Transformations

The Tukey Mosteller Bulge Diagram helps us pick which transformations to apply to data in order to linearize it.

The bottom-left quadrant appears to match the shape of the scatter plot between 'horsepower' and 'mpg' the best – let's try taking the log of 'horsepower' ($X$).

What does our data look like now?

Predicting 'mpg' using log('horsepower')

Let's fit another linear model.

What do our predictions look like now?

The fit looks a bit better! How about the $R^2$?

Also a bit better!

What do our predictions look like on the original, non-transformed scatter plot? Let's see:

Our predictions that used $\log(\text{Horsepower})$ as an input don't fall on a straight line. We shouldn't expect them to; the red dots come from:

$$\text{Predicted MPG} = 108.698 - 18.582 \cdot \log(\text{Horsepower})$$

Quantitative scaling

Until now, feature transformations we've discussed so far have involved converting categorical variables into quantitative variables. However, our log transformation was an example of transforming a quantitative variable into a new quantitative variable; this practice is called quantitative scaling.

Summary, next time

Summary

Next time