Lecture 20 – Features

DSC 80, Spring 2022

Announcements

Agenda

Recap: TF-IDF

Term frequency-inverse document frequency

The term frequency-inverse document frequency (TF-IDF) of word $t$ in document $d$ is the product:

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

Example: State of the Union addresses 🎤

Recall, last class, we computed the TF-IDF for every word and every SOTU speech. We used TF-IDFs to summarize speeches.

Aside: What if we remove the $\log$ from $\text{idf}(t)$?

Let's try it and see what happens.

The role of $\log$ in $\text{idf}(t)$

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

Features

Reflection

So far this quarter, we've learned how to:

Features

Note: TF-IDF is a feature we've created that summarizes documents!

Example: San Diego employee salaries

What features are present in salaries? What features can we create?

What makes a good feature?

Example: Predicting child heights 📏

Galton's heights dataset

Exploratory data analysis

The following scatter matrix contains a scatter plot of all pairs of quantitative attributes, and a histogram for each quantitative attribute on its own.

Is a linear model suitable for prediction? If so, on which attributes?

Attempt #1: Predict child's height using father's height

We will assume that the relationship between father's heights and child's heights is linear. That is,

$$\text{predicted child's height} = w_0^* + w_1^* \cdot \text{father's height}$$

where $w_0^*$ and $w_1^*$ are carefully chosen parameters.

seaborn's lmplot function can automatically plot the "line of best fit" on a scatter plot.

Recap: Simple linear regression

For any father's height $x_i$, their predicted child's height is given by

$$H(x_i) = w_0 + w_1x_i$$
$$\begin{align*}\text{MSE} &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - w_0 - w_1x_i \big)^2\end{align*}$$
$$w_1^* = r \cdot \frac{\sigma_y}{\sigma_x}$$$$w_0^* = \bar{y} - w_1^* \bar{x}$$

Finding the regression line programatically

There are several packages that can perform linear regression; scipy.stats is one of them.

The lm object has several attributes, most notably, slope and intercept.

pred_child words on scalar values:

But it also works on arrays/Series:

Recall, a lower MSE means a better fit on the training data. Let's compute the MSE of this simple linear model; it will be useful later.

Aside: MSE vs. RMSE

An issue with mean squared error is that its units are the square of the units of the $y$-values.

$$\text{MSE} = \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2$$

For instance, the number below is 11.892 "inches squared".

To correct the units of mean squared error, we can take the square root. The result, root mean squared error (RMSE) is also a measure of how well a model fits training data.

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2}$$

Important: The line that minimizes MSE is the same line that minimizes RMSE and SSE (sum of squared errors).

Let's create a dictionary to keep track of the RMSEs of the various models we create.

Visualizing our single-feature predictions

Attempt #2: Predict child's height using father's and mother's heights

$$\text{predicted child's height} = w_0^* + w_1^* \cdot \text{father's height} + w_2^* \cdot \text{mother's height}$$

Multiple regression in sklearn

We'll cover sklearn in more detail in the coming lectures.

A typical pattern in sklearn is instantiate, fit, and predict.

After calling fit on lr, we can access the intercept and coefficients of the plane of best fit (i.e. these are $w_0^*$, $w_1^*$, and $w_2^*$).

However, we don't actually need to access these directly. Fit LinearRegression objects have the predict method, which we can use directly:

How well does this model perform?

It seems like this two-feature model has a lower RMSE than the original single-feature model (which we'd expect), but it's only slightly lower.

Visualizing our two-feature predictions

Here, we must draw a 3D scatter plot and plane, with one axis for father's height, one axis for mother's height, and one axis for child's height. The code below does this.

If we want to visualize in 2D, we must pick a single feature to display on the $x$-axis.

Attempt #3: Adding gender as a feature

Observation: It appears that the two lines have similar slopes, but different intercepts.

Attempt #3: Adding gender as a feature

There's an issue: gender is a categorical feature, but in order to use it as a feature in a regression model, it must be quantitative.

Solution: Create a column named 'gender=female', that is

Now, we can use 'gender=female' as a feature, just as we used 'father' and 'mother' as features.

$$\text{predicted child's height} \\ = w_0^* + w_1^* \cdot \text{father's height} + w_2^* \cdot \text{mother's height} + w_3^* \cdot \text{gender=female}$$

The RMSE of our new three feature model is significantly lower than the RMSEs of the earlier models. This indicates that 'gender=female' is very useful in predicting child's heights.

Visualizing our three-feature predictions

To visualize our data and linear model, we'd need 4 dimensions:

Humans can't visualize in 4D, but there may be a solution.

Above, we are given the values of $w_0^*$, $w_1^*$, $w_2^*$, and $w_3^*$. This means our linear model is of the form:

$$\text{predicted child's height} \\ = 21.736 + 0.393 \cdot \text{father's height} + 0.318 \cdot \text{mother's height} - 5.215 \cdot \text{gender=female}$$

But remember, 'gender=female' is either 1 or 0. Let's look at those two cases separately.

$$\text{predicted child's height} = 16.521 + 0.393 \cdot \text{father's height} + 0.318 \cdot \text{mother's height}$$
$$\text{predicted child's height} = 21.736 + 0.393 \cdot \text{father's height} + 0.318 \cdot \text{mother's height}$$

If we want to visualize in 2D, we must pick a single feature to display on the $x$-axis.

Summary, next time

Summary

Next time: feature engineering