Lecture 20 – Modeling and Linear Regression

DSC 80, Winter 2023

📣 Announcements

RSVP to the capstone showcase on Wednesday, March 15th!

The senior capstone showcase is on Wednesday, March 15th in the Price Center East Ballroom. The DSC seniors will be presenting posters on their capstone projects. Come and ask them questions; if you're a DSC major, this will be you one day!

The session is broken into two blocks:

Look at the list of topics and RSVP here!

There will be no live DSC 80 lecture on the day of the showcase – instead, lecture will be pre-recorded!

Agenda

Modeling

Reflection

So far this quarter, we've learned how to:

Modeling

Goals of modeling

  1. To make accurate predictions regarding unseen data drawn from the data generating process.
    • Given this dataset of past UCSD data science students' salaries, can we predict your future salary? (regression)
    • Given this dataset of images, can we predict if this new image is of a dog, cat, or zebra? (classification)
  1. To make inferences about the structure of the data generating process, i.e. to understand complex phenomena.
    • Is there a linear relationship between the heights of children and the heights of their biological mothers?
    • The weights of smoking and non-smoking mothers' babies babies in my sample are different – how confident am I that this difference exists in the population?

Features

Example: Restaurant tips 🧑‍🍳

About the data

What features does the dataset contain?

Predicting tips

Exploratory data analysis (EDA)

Visualizing distributions

Observations

'total_bill' 'tip'
Right skewed Right skewed
Mean around \$20 Mean around \$3
Mode around \$16 Possibly bimodal at \$2 and \\$3?
No particularly large bills Large outliers?

Model #1: Constant

$$\text{tip} = h^{\text{true}}$$
George Box
"All models are wrong, but some are useful."

"Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity."

"Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad."

Estimating $h^{\text{true}}$

Empirical risk minimization

$$\text{MSE} = \frac{1}{n} \sum_{i = 1}^n ( y_i - h )^2 \overset{\text{calculus}}\implies h^* = \text{mean}(y)$$
$$\text{MAE} = \frac{1}{n} \sum_{i = 1}^n | y_i - h | \overset{\text{algebra}}\implies h^* = \text{median}(y)$$

The mean tip

Let's suppose we choose squared loss, meaning that $h^* = \text{mean}(y)$.

Let's visualize this prediction.

Note that to make predictions, this model ignores total bill (and all other features), and predicts the same tip for all tables.

The quality of predictions

$$\text{MSE} = \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2$$

Root mean squared error

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2}$$

Computing and storing the RMSE

Since we'll compute the RMSE for our future models too, we'll define a function that can compute it for us.

Let's compute the RMSE of our constant tip's predictions, and store it in a dictionary that we can refer to later on.

Key idea: Since the mean minimizes RMSE for the constant model, it is impossible to change the mean_tip argument above to another number and yield a lower RMSE.

Model #2: Simple linear regression using total bill

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill}$$

Recap: Simple linear regression

A simple linear regression model is a linear model with a single feature, as we have here. For any total bill $x_i$, the predicted tip $H(x_i)$ is given by

$$H(x_i) = w_0 + w_1x_i$$
$$\begin{align*}\text{MSE} &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - w_0 - w_1x_i \big)^2\end{align*}$$

Empirical risk minimization, by hand

$$\begin{align*}\text{MSE} &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - w_0 - w_1x_i \big)^2\end{align*}$$
$$w_1^* = r \cdot \frac{\sigma_y}{\sigma_x}$$$$w_0^* = \bar{y} - w_1^* \bar{x}$$

Regression in sklearn

sklearn

The LinearRegression class

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

Fitting a simple linear model

First, we must instantiate a LinearRegression object and fit it. By calling fit, we are saying "minimize mean squared error on this dataset and find $w^*$."

After fitting, we can access $w^*$ – that is, the best slope and intercept.

These coefficients tell us that the "best way" (according to squared loss) to make tip predictions using a linear model is using:

$$\text{predicted tip} = 0.92 + 0.105 \cdot \text{total bill}$$

This model assumes people tip by:

Let's visualize this model, along with our previous model.

Visually, our linear model seems to be a better fit for our dataset than our constant model.

Making predictions

Fit LinearRegression objects also have a predict method, which can be used to predict tips for any total bill, new or old.

Comparing models

If we want to compute the RMSE of our model on the training data, we need to find its predictions on every row in the training data, tips.

Model #3: Multiple linear regression using total bill and table size

$$\text{predicted tip} = w_0 + w_1 \cdot \text{total bill} + w_2 \cdot \text{table size}$$

Multiple linear regression

To find the optimal parameters $w^*$, we will again use sklearn's LinearRegression class. The code is not all that different!

What does this model look like?

Plane of best fit ✈️

Here, we must draw a 3D scatter plot and plane, with one axis for total bill, one axis for table size, and one axis for tip. The code below does this.

Comparing models, again

How does our two-feature linear model stack up to our single feature linear model and our constant model?

Conclusion

Summary, next time

Summary

Next time