import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)
Feature engineering is the act of finding transformations that transform data into effective quantitative variables.
A feature function $\phi$ (phi, pronounced "fea") is a mapping from raw data to $d$-dimensional space, i.e. $\phi: \text{raw data} \rightarrow \mathbb{R}^d$.
A "good" choice of features depends on many factors:
We want to build a multiple regression model that uses the features ('UID'
, 'AGE'
, 'STATE'
, 'HAS_BOUGHT'
, and 'REVIEW'
) below to predict 'RATING'
.
Why can't we build a model right away?
UID | AGE | STATE | HAS_BOUGHT | REVIEW | \ | RATING | |
---|---|---|---|---|---|---|---|
74 | 32 | NY | True | "Meh." | | | ✩✩ | |
42 | 50 | WA | True | "Worked out of the box..." | | | ✩✩✩✩ | |
57 | 16 | CA | NULL | "Hella tots lit yo..." | | | ✩ | |
... | ... | ... | ... | ... | | | ... | |
(int) | (int) | (str) | (bool) | (str) | | | (str) |
'UID'
was likely used to join the user information (e.g., 'AGE'
and 'STATE'
) with some reviews
dataset.'UID'
s are stored as numbers, the numerical value of a user's 'UID'
won't help us predict their 'RATING'
.'UID'
feature, our model will find whatever patterns it can between 'UID'
s and 'RATING'
s in the training (observed data).'UID'
and 'RATING'
, this will lead to worse model performance on unseen data (bad).'UID'
.There are certain scenarios where manually dropping features might be helpful:
'RATING'
s is so that we can predict 'RATING'
s for users who haven't actually made a 'RATING'
s yet.'RATING'
.'REVIEW'
s after entering 'RATING'
s, we shouldn't use 'REVIEW'
s as a feature.UID | AGE | STATE | HAS_BOUGHT | REVIEW | \ | RATING | |
---|---|---|---|---|---|---|---|
74 | 32 | NY | True | "Meh." | | | ✩✩ | |
42 | 50 | WA | True | "Worked out of the box..." | | | ✩✩✩✩ | |
57 | 16 | CA | NULL | "Hella tots lit yo..." | | | ✩ | |
... | ... | ... | ... | ... | | | ... | |
(int) | (int) | (str) | (bool) | (str) | | | (str) |
How do we encode the 'RATING'
column, an ordinal variable, as a quantitative variable?
order_values = ['✩', '✩✩', '✩✩✩', '✩✩✩✩', '✩✩✩✩✩']
ordinal_enc = {y:x + 1 for (x, y) in enumerate(order_values)}
ordinal_enc
{'✩': 1, '✩✩': 2, '✩✩✩': 3, '✩✩✩✩': 4, '✩✩✩✩✩': 5}
ratings = pd.DataFrame().assign(RATING=['✩', '✩✩', '✩✩✩', '✩✩', '✩✩✩', '✩', '✩✩✩', '✩✩✩✩', '✩✩✩✩✩'])
ratings
RATING | |
---|---|
0 | ✩ |
1 | ✩✩ |
2 | ✩✩✩ |
3 | ✩✩ |
4 | ✩✩✩ |
5 | ✩ |
6 | ✩✩✩ |
7 | ✩✩✩✩ |
8 | ✩✩✩✩✩ |
ratings.replace(ordinal_enc)
RATING | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 2 |
4 | 3 |
5 | 1 |
6 | 3 |
7 | 4 |
8 | 5 |
UID | AGE | STATE | HAS_BOUGHT | REVIEW | \ | RATING | |
---|---|---|---|---|---|---|---|
74 | 32 | NY | True | "Meh." | | | ✩✩ | |
42 | 50 | WA | True | "Worked out of the box..." | | | ✩✩✩✩ | |
57 | 16 | CA | NULL | "Hella tots lit yo..." | | | ✩ | |
... | ... | ... | ... | ... | | | ... | |
(int) | (int) | (str) | (bool) | (str) | | | (str) |
How do we encode the 'STATE'
column, a nominal variable, as a quantitative variable?
'STATE'
s into meaningful numbers?'STATE'
is not an ordinal variable - Wyoming is not inherently "more" of anything than Alabama.'is_AL'
, 'is_AK'
, ..., 'is_WY'
.'col'
has $N$ unique values, $A_1$, $A_2$, ..., $A_N$. For each unique value $A_i$, we define the following feature function:'STATE'
¶'STATE'
in our dataset, we must create a column for just that 'STATE'
.Let's perform the one-hot encoding ourselves.
states = pd.DataFrame().assign(STATE=['NY', 'WA', 'CA', 'NY', 'OR'])
states
STATE | |
---|---|
0 | NY |
1 | WA |
2 | CA |
3 | NY |
4 | OR |
First, we need to access all unique values of 'STATE'
.
unique_states = states['STATE'].unique()
unique_states
array(['NY', 'WA', 'CA', 'OR'], dtype=object)
How might we create one-hot-encoded columns manually?
states['STATE'] == unique_states[0]
0 True 1 False 2 False 3 True 4 False Name: STATE, dtype: bool
pd.Series(states['STATE'] == unique_states[1], dtype=int)
0 0 1 1 2 0 3 0 4 0 Name: STATE, dtype: int64
def ohe_states(states_ser):
return pd.Series(states_ser == unique_states, index=unique_states, dtype=int)
states['STATE'].apply(ohe_states)
NY | WA | CA | OR | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 |
Soon, we will learn how to "automatically" perform one-hot encoding.
The feature transformations we've discussed so far have involved converting categorical variables into quantitative variables. However, at times we'll need to transform quantitative variables into new quantitative variables.
tips = sns.load_dataset('tips')
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
'total_bill'
.'total_bill'
and 'tip'
, as well as the distributions of both columns individually.sns.lmplot(data=tips, x='total_bill', y='tip');
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
sns.histplot(tips['total_bill'], kde=True, ax=ax1)
sns.histplot(tips['tip'], kde=True, ax=ax2);
'total_bill' |
'tip' |
---|---|
Right skewed | Right skewed |
Mean around \$20 | Mean around \$3 |
Mode around \$15 | Possibly bimodal? |
No large bills | Large outliers? |
"...but some are useful."
"Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity."
"Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad."
-- George Box
There are several ways we could estimate $h^{\text{true}}$.
From DSC 40A, we already know one way:
Let's suppose we choose squared loss, meaning that $h^* = \text{mean}(y)$.
mean_tip = tips['tip'].mean()
mean_tip
2.9982786885245902
Recall, minimizing MSE is the same as minimizing RMSE, however RMSE has the added benefit that it is in the same units as our data. We will compute and keep track of the RMSEs of the different models we build (as we did last lecture).
def rmse(actual, pred):
return np.sqrt(np.mean((actual - pred) ** 2))
rmse(tips['tip'], mean_tip)
1.3807999538298958
rmse_dict = {}
rmse_dict['constant, tip'] = rmse(tips['tip'], mean_tip)
Since the mean minimizes RMSE for the constant model, it is impossible to change the mean_tip
argument above to another number and yield a lower RMSE.
If we are going to make a constant prediction, a more natural constant to predict might be the tip percentage.
We can derive the 'pct_tip'
feature ourselves using existing information: $$\texttt{pct_tip} = \frac{\texttt{tip}}{\texttt{total_bill}}$$
tips = tips.assign(pct_tip=(tips['tip'] / tips['total_bill']))
sns.histplot(tips['pct_tip'], kde=True);
tips
.mean_pct_tip = tips['pct_tip'].mean()
mean_pct_tip
0.16080258172250478
'tip'
, but above we have predicted 'pct_tip'
.'pct_tip'
is a multiplier that we apply to 'total_bill'
to get 'tip'
. That is:tips['total_bill'] * mean_pct_tip
0 2.732036 1 1.662699 2 3.378462 3 3.807805 4 3.954135 ... 239 4.668099 240 4.370614 241 3.645395 242 2.865502 243 3.019872 Name: total_bill, Length: 244, dtype: float64
rmse_dict['constant, pct_tip'] = rmse(tips['tip'], tips['total_bill'] * mean_pct_tip)
rmse_dict
{'constant, tip': 1.3807999538298958, 'constant, pct_tip': 1.146820820140744}
mean_pct_tip
0.16080258172250478
rmse_dict
{'constant, tip': 1.3807999538298958, 'constant, pct_tip': 1.146820820140744}
By choosing a loss function and minimizing empirical risk, we can find $w_0^*$ and $w_1^*$.
In order to use a linear model, the data should have a linear association.
sns.lmplot(data=tips, x='total_bill', y='tip');
Again, we will learn more about sklearn
in the coming lectures.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X=tips[['total_bill']], y=tips['tip'])
LinearRegression()
lr.intercept_, lr.coef_
(0.9202696135546735, array([0.10502452]))
Note that the above coefficients state that the "best way" (according to squared loss) to make tip predictions using a linear model is to assume people
preds = lr.predict(X=tips[['total_bill']])
rmse_dict['linear model'] = rmse(tips['tip'], preds)
rmse_dict
{'constant, tip': 1.3807999538298958, 'constant, pct_tip': 1.146820820140744, 'linear model': 1.0178504025697377}
There's a lot of information in tips
that we didn't use – 'sex'
, 'day'
, and 'time'
, for example. How might we encode this information?
tips
total_bill | tip | sex | smoker | day | time | size | pct_tip | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 0.203927 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 0.073584 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 0.088222 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 0.098204 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 0.159744 |
244 rows × 8 columns
tips
example. Start formally learning sklearn
.