Lecture 25 – Grid Search, Multicollinearity, Examples

DSC 80, Winter 2023

Announcements

Agenda

Example: Predicting diabetes

Recall, we started with a relatively simple decision tree.

Goal

Create a DecisionTreeClassifier that

GridSearchCV takes in:

and performs $k$-fold cross-validation to find the combination of hyperparameters with the best average validation performance.

The following dictionary contains the values we're considering for each hyperparameter. (We're using GridSearchCV with 3 hyperparameters, but we could use it with even just a single hyperparameter.)

Note that there are 140 combinations of hyperparameters we need to try. We need to find the best combination of hyperparameters, not the best value for each hyperparameter individually.

GridSearchCV needs to be instantiated and fit.

After being fit, the best_params_ attribute provides us with the best combination of hyperparameters to use.

All of the intermediate results – validation accuracies for each fold, mean validation accuaries, etc. – are stored in the cv_results_ attribute:

Note that the above DataFrame tells us that 5 * 140 = 700 models were trained in total!

Now that we've found the best combination of hyperparameters, we should fit a decision tree instance using those hyperparameters on our entire training set.

Remember, searcher itself is a model object (we had to fit it). After performing $k$-fold cross-validation, behind the scenes, searcher is trained on the entire training set using the optimal combination of hyperparameters.

In other words, searcher makes the same predictions that final_tree does!

Choosing possible hyperparameter values

Key takeaways

Multicollinearity

Heights and weights

We have a dataset containing the weights and heights of 25,0000 18 year olds, taken from here.

Motivating example

Suppose we fit a simple linear regression model that uses height in inches to predict weight in pounds.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)}$$

$w_0^*$ and $w_1^*$ are shown below, along with the model's testing RMSE.

Now, suppose we fit another regression model, that uses height in inches AND height in centimeters to predict weight.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)} + w_2 \cdot \text{height (cm)}$$

What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's testing RMSE?

Observation: The intercept is the same as before (roughly -81.17), as is the testing RMSE. However, the coefficients on 'Height (Inches)' and 'Height (cm)' are massive in size!

What's going on?

Redundant features

Let's use simpler numbers for illustration. Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.

$$\text{predicted weight (pounds)} = -80 + 3 \cdot \text{height (inches)}$$

In the second model, we have:

$$\begin{align*}\text{predicted weight (pounds)} &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \text{height (cm)} \\ &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \big( 2.54^* \cdot \text{height (inches)} \big) \\ &= w_0^* + \left(w_1^* + 2.54 \cdot w_2^* \right) \cdot \text{height (inches)} \end{align*}$$

In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.

So, as long as $w_1^* + 2.54 \cdot w_2^* = 3$ in the second model, the second model's training predictions will be the same as the first, and hence they will also minimize RMSE.

Infinitely many parameter choices

Issue: There are an infinite number of $w_1^*$ and $w_2^*$ values that satisfy $w_1^* + 2.54 \cdot w_2^* = 3$!

$$\text{predicted weight} = -80 - 10 \cdot \text{height (inches)} + \frac{13}{2.54} \cdot \text{height (cm)}$$
$$\text{predicted weight} = -80 + 10 \cdot \text{height (inches)} - \frac{7}{2.54} \cdot \text{height (cm)}$$

Multicollinearity

One hot encoding and multicollinearity

When we one hot encode categorical features, we create several redundant columns.

Aside: You can use pd.get_dummies in EDA, but don't use it for modeling (instead, use OneHotEncoder, which works with Pipelines).

Remember that under the hood, LinearRegression() creates a design matrix that has a column of all ones (for the intercept term). Let's add that column above for demonstration.

Now, many of the above columns can be written as linear combinations of other columns!

Note that if we get rid of the four redundant columns above, the rank of our design matrix – that is, the number of linearly independent columns it has – does not change (and so the "predictive power" of our features don't change either).

However, without the redundant columns, there is only a single unique set of optimal parameters $w^*$, and the multicollinearity is no more.

Aside: Most one hot encoding techniques (including OneHotEncoder) have an in-built drop argument, which allow you to specify that you'd like to drop one column per categorical feature.

The above array only has $(2-1) + (2-1) + (4-1) + (2-1) = 6$ columns, rather than $2 + 2 + 4 + 2 = 10$, since we dropped 1 per categorical column in tips_features.

Key takeaways

Example: Modeling using text features

Example: Predicting reviews

We have a dataset containing Amazon reviews and ratings for patio, lawn, and gardening products. (Aside: Here is a good source for such data.)

Goal: Use a review's 'summary' to predict its 'overall' rating.

Note that there are five possible 'overall' rating values – 1, 2, 3, 4, 5 – not just two. As such, this is an instance of multiclass classification.

Question: What is the worst possible accuracy we should expect from a ratings classifier, given the above distribution?

Aside: CountVectorizer

Entries in the 'summary' column are not currently quantitative! We can use the bag of words encoding to create quantitative features out of each 'summary'.

Instead of performing a bag of words encoding manually as we did before, we can rely on sklearn's CountVectorizer. (There is also a TfidfVectorizer.)

count_vec learned a vocabulary from the corpus we fit it on.

Note that the values in count_vec.vocabulary_ correspond to the positions of the columns in count_vec.transform(example_corp).toarray(), i.e. 'billy' is the first column and 'your' is the last column.

Creating an initial Pipeline

Let's build a Pipeline that takes in summaries and overall ratings and:

But first, a train-test split (like always).

To start, we'll create a random forest with 7 trees (n_estimators) each of which has a maximum depth of 8 (max_depth).

The accuracy of our random forest is just above 50%, on both the training and testing sets. We'd get the same performance by predicting a rating of 5 every time!

Choosing tree depth via GridSearchCV

We arbitrarily chose max_depth=8 before, but it seems like that isn't working well. Let's perform a grid search to find the max_depth with the best generalization performance.

Note that while pl has already been fit, we can still give it to GridSearchCV, which will repeatedly re-fit it during cross-validation.

Recall, fit GridSearchCV objects are estimators on their own as well. This means we can compute the training and testing accuracies of the "best" random forest directly:

Still not much better on the testing set! 🤷

Training and validation accuracy vs. depth

Below, we plot how training and validation accuracy varied with tree depth. Note that the $y$-axis here is accuracy, and that larger accuracies are better (unlike with RMSE, where smaller was better).

Unsurprisingly, training accuracy kept increasing, while validation accuracy leveled off around a depth of ~100.

Summary, next time

Summary

Next time

Metrics for measuring the performance of classifiers other than accuracy.