Lecture 21 – Feature Engineering and Modeling

DSC 80, Spring 2022

Announcements

Agenda

We won't finish the galton example from the last lecture, but you should read through the end of it, as it is a nice complement to today's lecture.

Feature engineering ⚙️

The goal of feature engineering

Example: Predicting ratings ⭐️

UID AGE STATE HAS_BOUGHT REVIEW \ RATING
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." |
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

Uninformative features

Dropping features

There are certain scenarios where manually dropping features might be helpful:

  1. When the features do not contain information associated with the prediction task.
  2. When the feature is not available at prediction time.

Encoding ordinal features

UID AGE STATE HAS_BOUGHT REVIEW \ RATING
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." |
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

How do we encode the 'RATING' column, an ordinal variable, as a quantitative variable?

Encoding nominal features

UID AGE STATE HAS_BOUGHT REVIEW \ RATING
74 32 NY True "Meh." | ✩✩
42 50 WA True "Worked out of the box..." | ✩✩✩✩
57 16 CA NULL "Hella tots lit yo..." |
... ... ... ... ... | ...
(int) (int) (str) (bool) (str) | (str)

How do we encode the 'STATE' column, a nominal variable, as a quantitative variable?

One-hot encoding

$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x = A_i \\ 0 & {\rm if\ } x\neq A_i \\ \end{array}\right. $$

Example: One-hot encoding 'STATE'

Let's perform the one-hot encoding ourselves.

First, we need to access all unique values of 'STATE'.

How might we create one-hot-encoded columns manually?

Soon, we will learn how to "automatically" perform one-hot encoding.

Quantitative scaling

The feature transformations we've discussed so far have involved converting categorical variables into quantitative variables. However, at times we'll need to transform quantitative variables into new quantitative variables.

Modeling

Modeling

Goals of modeling

Data to models

Example: Restaurant tips 🧑‍🍳

Predicting tips

Exploratory data analysis (EDA)

Observations

'total_bill' 'tip'
Right skewed Right skewed
Mean around \$20 Mean around \$3
Mode around \$15 Possibly bimodal?
No large bills Large outliers?

Model #1: Constant

$$\texttt{tip} = h^{\text{true}}$$

All models are wrong...

"...but some are useful."

"Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity."

"Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad."

-- George Box

Estimating $h^{\text{true}}$

Empirical risk minimization

$$\text{MSE} = \frac{1}{n} \sum_{i = 1}^n ( y_i - h )^2 \overset{\text{calculus}}\implies h^* = \text{mean}(y)$$
$$\text{MAE} = \frac{1}{n} \sum_{i = 1}^n | y_i - h | \overset{\text{algebra}}\implies h^* = \text{median}(y)$$

The mean tip

Let's suppose we choose squared loss, meaning that $h^* = \text{mean}(y)$.

Recall, minimizing MSE is the same as minimizing RMSE, however RMSE has the added benefit that it is in the same units as our data. We will compute and keep track of the RMSEs of the different models we build (as we did last lecture).

Since the mean minimizes RMSE for the constant model, it is impossible to change the mean_tip argument above to another number and yield a lower RMSE.

Model #2: Tip percentages instead of tips

The mean tip percentage

$$\texttt{pct_tip} = h^{\text{true}}$$
$$\text{predicted tip} = \text{total bill} \cdot \text{mean pct-tip}$$

Constant tip vs. constant tip percentage

Model #3: Linear model

$$\text{predicted tip} = w_0 + w_1 \cdot \text{tip}$$

Fitting a linear model

Again, we will learn more about sklearn in the coming lectures.

Note that the above coefficients state that the "best way" (according to squared loss) to make tip predictions using a linear model is to assume people

Conclusion

What's next?

There's a lot of information in tips that we didn't use – 'sex', 'day', and 'time', for example. How might we encode this information?

Summary, next time

Summary