from dsc80_utils import *
# The dataset is built into plotly (and seaborn)!
# We shuffle here so that the head of the DataFrame contains rows where smoker is Yes and smoker is No,
# purely for illustration purposes (it doesn't change any of the math).
np.random.seed(1)
tips = px.data.tips().sample(frac=1).reset_index(drop=True)
Announcements 📣¶
- Lab 7 is due tomorrow.
- The Final Project is out!
- It will be worth two projects (because it used to be two separate projects).
- It will have two short checkpoints due this Friday and next Friday.
- You can request an extension on the checkpoints.
- You cannot request an extension on final submission deadline on Wednesday, June 12 (the Wed of finals week).
Agenda 📆¶
- Review: Predicting tips.
- $R^2$.
- Feature engineering.
- Example: Predicting tips.
- One hot encoding.
- Example: Predicting ratings ⭐️.
- Dropping features.
- Ordinal encoding.
- Example: Horsepower 🚗.
- Quantitative scaling.
- Example: Predicting tips.
- Feature engineering in
sklearn
.- Transformer classes.
- Creating
Pipeline
s.
Review: Predicting tips 🧑🍳¶
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
241 | 17.47 | 3.50 | Female | No | Thur | Lunch | 2 |
242 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 |
243 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 |
244 rows × 7 columns
Linear models¶
Last time, we fit three linear models to predict restaurant tips:
- Constant model: $\text{predicted tip} = h$.
- Simple linear regression: $\text{predicted tip} = w_0 + w_1 \cdot \text{total bill}$.
- Multiple linear regression: $\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size}$.
In the constant model case, we know that the optimal model parameter, when using squared loss, is $h^* = \text{mean tip}$.
mean_tip = tips['tip'].mean()
In the other two cases, we used the LinearRegression
class from sklearn
to help us find optimal model parameters.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=tips[['total_bill']], y=tips['tip'])
model_two = LinearRegression()
model_two.fit(X=tips[['total_bill', 'size']], y=tips['tip'])
LinearRegression()
Root mean squared error¶
To compare the performance of different models, we used the root mean squared error (RMSE).
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2}$$def rmse(actual, pred):
return np.sqrt(np.mean((actual - pred) ** 2))
rmse_dict = {}
rmse_dict['constant tip amount'] = rmse(tips['tip'], mean_tip)
all_preds = model.predict(tips[['total_bill']])
rmse_dict['one feature: total bill'] = rmse(tips['tip'], all_preds)
rmse_dict['two features'] = rmse(
tips['tip'], model_two.predict(tips[['total_bill', 'size']])
)
pd.DataFrame({'rmse': rmse_dict.values()}, index=rmse_dict.keys())
rmse | |
---|---|
constant tip amount | 1.38 |
one feature: total bill | 1.02 |
two features | 1.01 |
The .score
method of a LinearRegression
object¶
Model objects in sklearn
that have already been fit have a score
method.
model_two.score(tips[['total_bill', 'size']], tips['tip'])
0.46786930879612565
That doesn't look like the RMSE... what is it? 🤔
Aside: $R^2$¶
- $R^2$, or the coefficient of determination, is a measure of the quality of a linear fit.
- There are a few equivalent ways of computing it, assuming your model is linear and has an intercept term:
- Interpretation: $R^2$ is the proportion of variance in $y$ that the linear model explains.
- In the simple linear regression case, it is the square of the correlation coefficient, $r$.
- Key idea: $R^2$ ranges from 0 to 1. The closer it is to 1, the better the linear fit is.
- $R^2$ has no units of measurement, unlike RMSE.
Calculating $R^2$¶
Let's calculate the $R^2$ for model_two
's predictions in three different ways.
pred = tips.assign(predicted=model_two.predict(tips[['total_bill', 'size']]))
pred
total_bill | tip | sex | smoker | day | time | size | predicted | |
---|---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 1.15 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 2.80 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 | 3.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
241 | 17.47 | 3.50 | Female | No | Thur | Lunch | 2 | 2.67 |
242 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 | 1.99 |
243 | 16.93 | 3.07 | Female | No | Sat | Dinner | 3 | 2.82 |
244 rows × 8 columns
Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$
np.var(pred['predicted']) / np.var(pred['tip'])
0.4678693087961255
Method 2: $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$
Note: By correlation here, we are referring to $r$, the same correlation coefficient you saw in DSC 10.
pred.corr().loc['predicted', 'tip'] ** 2
0.46786930879612554
Method 3: LinearRegression.score
model_two.score(tips[['total_bill', 'size']], tips['tip'])
0.46786930879612565
All three methods provide the same result!
Relationship between $R^2$ and RMSE¶
For linear models with an intercept term,
$$R^2 = 1 - \frac{\text{RMSE}^2}{\text{var}(\text{actual $y$ values})}$$1 - rmse(pred['tip'], pred['predicted']) ** 2 / np.var(pred['tip'])
0.4678693087961261
What's next?¶
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
3 | 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
4 | 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
So far, in our journey to predict
'tip'
, we've only used the existing numerical features in our dataset,'total_bill'
and'size'
.There's a lot of information in tips that we didn't use –
'sex'
,'smoker'
,'day'
, and'time'
, for example. We can't use these features in their current form, because they're non-numeric.How do we use categorical features in a regression model?
Feature engineering ⚙️¶
The goal of feature engineering¶
- Feature engineering is the act of finding transformations that transform data into effective quantitative variables.
- A feature function $\phi$ (phi, pronounced "fea") is a mapping from raw data to $d$-dimensional space, i.e. $\phi: \text{raw data} \rightarrow \mathbb{R}^d$.
- If two observations $x_i$ and $x_j$ are "similar" in the raw data space, then $\phi(x_i)$ and $\phi(x_j)$ should also be "similar."
- A "good" choice of features depends on many factors:
- The kind of data, i.e. quantitative, ordinal, or nominal.
- The relationship(s) being modeled.
- The model type, e.g. linear models, decision tree models, neural networks.
- To introduce different feature functions, we'll look at several different example datasets.
One hot encoding¶
- One hot encoding is a transformation that turns a categorical feature into several binary features.
- Suppose a column has $N$ unique values, $A_1$, $A_2$, ..., $A_N$. For each unique value $A_i$, we define the following feature function:
- Note that 1 means "yes" and 0 means "no".
- One hot encoding is also called "dummy encoding", and $\phi(x)$ may also be referred to as an "indicator variable".
Example: One hot encoding 'smoker'
¶
For each unique value of 'smoker'
in our dataset, we must create a column for just that 'smoker'
. (Remember, 'smoker'
is 'Yes'
when the table was in the smoking section of the restaurant and 'No'
otherwise.)
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
2 | 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
3 | 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
4 | 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
tips['smoker'].value_counts()
No 151 Yes 93 Name: smoker, dtype: int64
(tips['smoker'] == 'Yes').astype(int).head()
0 1 1 0 2 1 3 0 4 0 Name: smoker, dtype: int64
for val in tips['smoker'].unique():
tips[f'smoker == {val}'] = (tips['smoker'] == val).astype(int)
tips.head()
total_bill | tip | sex | smoker | ... | time | size | smoker == Yes | smoker == No | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | ... | Dinner | 1 | 1 | 0 |
1 | 18.78 | 3.00 | Female | No | ... | Dinner | 2 | 0 | 1 |
2 | 26.59 | 3.41 | Male | Yes | ... | Dinner | 3 | 1 | 0 |
3 | 14.26 | 2.50 | Male | No | ... | Lunch | 2 | 0 | 1 |
4 | 21.16 | 3.00 | Male | No | ... | Lunch | 2 | 0 | 1 |
5 rows × 9 columns
Model #4: Multiple linear regression using total bill, table size, and smoker status¶
Now that we've converted 'smoker'
to a numerical variable, we can use it as input in a regression model. Here's the model we'll try to fit:
Subtlety: There's no need to use both 'smoker == No'
and 'smoker == Yes'
. If we know the value of one, we already know the value of the other. We can use either one.
model_three = LinearRegression()
model_three.fit(tips[['total_bill', 'size', 'smoker == Yes']], tips['tip'])
LinearRegression()
The following cell gives us our $w^*$s:
model_three.intercept_, model_three.coef_
(0.7090155167346053, array([ 0.09, 0.18, -0.08]))
Thus, our trained linear model to predict tips given total bills, table sizes, and smoker status (yes or no) is:
$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$Visualizing Model #4¶
Our new fit model is:
$$\text{predicted tip} = 0.709 + 0.094 \cdot \text{total bill} + 0.180 \cdot \text{table size} - 0.083 \cdot \text{smoker == Yes}$$To visualize our data and linear model, we'd need 4 dimensions:
- One for total bill
- One for table size
- One for
'smoker == Yes'
. - One for tip.
Humans can't visualize in 4D, but there may be a solution. We know that 'smoker == Yes'
only has two possible values, 1 or 0, so let's look at those cases separately.
Case 1: 'smoker == Yes'
is 1, meaning that the table was in the smoking section.
Case 2: 'smoker == Yes'
is 0, meaning that the table was not in the smoking section.
Key idea: These are two parallel planes in 3D, with different $z$-intercepts!
Note that the two planes are very close to one another – you'll have to zoom in to see the difference.
# pio.renderers.default = 'plotly_mimetype+notebook' # If it doesn't render, try uncommenting this.
XX, YY = np.mgrid[0:50:2, 0:8:1]
Z_0 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 0
Z_1 = model_three.intercept_ + model_three.coef_[0] * XX + model_three.coef_[1] * YY + model_three.coef_[2] * 1
plane_0 = go.Surface(x=XX, y=YY, z=Z_0, colorscale='Greens')
plane_1 = go.Surface(x=XX, y=YY, z=Z_1, colorscale='Purples')
fig = go.Figure(data=[plane_0, plane_1])
tips_0 = tips[tips['smoker'] == 'No']
tips_1 = tips[tips['smoker'] == 'Yes']
fig.add_trace(go.Scatter3d(x=tips_0['total_bill'],
y=tips_0['size'],
z=tips_0['tip'], mode='markers', marker = {'color': 'green'}))
fig.add_trace(go.Scatter3d(x=tips_1['total_bill'],
y=tips_1['size'],
z=tips_1['tip'], mode='markers', marker = {'color': 'purple'}))
fig.update_layout(scene = dict(
xaxis_title='Total Bill',
yaxis_title='Table Size',
zaxis_title='Tip'),
title='Tip vs. Total Bill and Table Size (Green = Non-Smoking Section, Purple = Smoking Section)',
width=1000, height=800,
showlegend=False)