Lecture 20 – Features

DSC 80, Spring 2022



Recap: TF-IDF

Term frequency-inverse document frequency

The term frequency-inverse document frequency (TF-IDF) of word $t$ in document $d$ is the product:

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

Example: State of the Union addresses 🎤

Recall, last class, we computed the TF-IDF for every word and every SOTU speech. We used TF-IDFs to summarize speeches.

Aside: What if we remove the $\log$ from $\text{idf}(t)$?

Let's try it and see what happens.

The role of $\log$ in $\text{idf}(t)$

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$



So far this quarter, we've learned how to:


Note: TF-IDF is a feature we've created that summarizes documents!

Example: San Diego employee salaries

What features are present in salaries? What features can we create?

What makes a good feature?

Example: Predicting child heights 📏

Galton's heights dataset

Exploratory data analysis

The following scatter matrix contains a scatter plot of all pairs of quantitative attributes, and a histogram for each quantitative attribute on its own.

Is a linear model suitable for prediction? If so, on which attributes?

Attempt #1: Predict child's height using father's height

We will assume that the relationship between father's heights and child's heights is linear. That is,

$$\text{predicted child's height} = w_0^* + w_1^* \cdot \text{father's height}$$

where $w_0^*$ and $w_1^*$ are carefully chosen parameters.

seaborn's lmplot function can automatically plot the "line of best fit" on a scatter plot.

Recap: Simple linear regression

For any father's height $x_i$, their predicted child's height is given by

$$H(x_i) = w_0 + w_1x_i$$
$$\begin{align*}\text{MSE} &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - w_0 - w_1x_i \big)^2\end{align*}$$
$$w_1^* = r \cdot \frac{\sigma_y}{\sigma_x}$$$$w_0^* = \bar{y} - w_1^* \bar{x}$$

Finding the regression line programatically

There are several packages that can perform linear regression; scipy.stats is one of them.

The lm object has several attributes, most notably, slope and intercept.

pred_child words on scalar values:

But it also works on arrays/Series:

Recall, a lower MSE means a better fit on the training data. Let's compute the MSE of this simple linear model; it will be useful later.

Aside: MSE vs. RMSE

An issue with mean squared error is that its units are the square of the units of the $y$-values.

$$\text{MSE} = \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2$$

For instance, the number below is 11.892 "inches squared".

To correct the units of mean squared error, we can take the square root. The result, root mean squared error (RMSE) is also a measure of how well a model fits training data.

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2}$$

Important: The line that minimizes MSE is the same line that minimizes RMSE and SSE (sum of squared errors).

Let's create a dictionary to keep track of the RMSEs of the various models we create.

Visualizing our single-feature predictions