In [1]:

from dsc80_utils import *

Lecture 18 – Classifier Evaluation, Conclusion, Final Review¶

DSC 80, Fall 2023¶

📣 Announcements 📣¶

Course evals (SET and End-of-Quarter Survey) due tomorrow at 11:59pm.
- If 85% of class fills it out, everyone gets +1% to their final exam grade!
Final Exam on Mon, Dec 11, 3-6pm in WLH 2005 (our usual lecture room).
Final Project due Wed, Dec 13.
- No slip days allowed, since we need to turn in grades right after it's due.

📝 Final Exam¶

Monday, Dec 11, 3-6pm in WLH 2005 (usual lecture room).
- Will write the exam to take about 2 hours, so you'll have a lot of time to double check your work.
Two 8.5"x11" cheat sheets allowed of your own creation (handwritten on tablet, then printed is okay.)
Covers every lecture, lab, and project.
Similar format to the midterm: mix of fill-in-the-blank, multiple choice, and free response.
- I use pandas fill-in-the-blank questions to test your ability to read and write code, not just write code from scratch, which is why they can feel tricker.
Questions on final about pre-Midterm material will be marked as "M". Your Midterm grade will be the higher of your (z-score adjusted) grades on the Midterm and the questions marked as "M" on the final.

🙋🙋🏽‍♀️ Questions?¶

https://app.sli.do/event/2LZSnXWNpGPiuVnCZMa5J8

No description has been provided for this image

Example: Tumor malignancy prediction (via logistic regression)¶

Wisconsin breast cancer dataset¶

The Wisconsin breast cancer dataset (WBCD) is a commonly-used dataset for demonstrating binary classification. It is built into sklearn.datasets.

In [2]:

from sklearn.datasets import load_breast_cancer
loaded = load_breast_cancer() # explore the value of `loaded`!
data = loaded['data']
labels = 1 - loaded['target']
cols = loaded['feature_names']
bc = pd.DataFrame(data, columns=cols)

In [3]:

bc.head()

Out[3]:

	mean radius	mean texture	mean perimeter	mean area	...	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	...	0.71	0.27	0.46	0.12
1	20.57	17.77	132.90	1326.0	...	0.24	0.19	0.28	0.09
2	19.69	21.25	130.00	1203.0	...	0.45	0.24	0.36	0.09
3	11.42	20.38	77.58	386.1	...	0.69	0.26	0.66	0.17
4	20.29	14.34	135.10	1297.0	...	0.40	0.16	0.24	0.08

5 rows × 30 columns

1 stands for "malignant", i.e. cancerous, and 0 stands for "benign", i.e. safe.

In [4]:

labels[:5]

Out[4]:

array([1, 1, 1, 1, 1])

In [5]:

pd.Series(labels).value_counts(normalize=True)

Out[5]:

0    0.63
1    0.37
dtype: float64

Our goal is to use the features in bc to predict labels.

Aside: Logistic regression¶

Logistic regression is a linear classification technique that builds upon linear regression. It models the probability of belonging to class 1, given a feature vector:

$$P(y = 1 | \vec{x}) = \sigma (\underbrace{w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}}_{\text{linear regression model}})$$

Here, $\sigma(t) = \frac{1}{1 + e^{-t}}$ is the sigmoid function; its outputs are between 0 and 1 (which means they can be interpreted as probabilities).

🤔 Question: Suppose our logistic regression model predicts the probability that a tumor is malignant is 0.75. What class do we predict – malignant or benign? What if the predicted probability is 0.3?

🙋 Answer: We have to pick a threshold (e.g. 0.5)!

If the predicted probability is above the threshold, we predict malignant (1).
Otherwise, we predict benign (0).
In practice, use CV to decide threshold.

Fitting a logistic regression model¶

In [6]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [7]:

X_train, X_test, y_train, y_test = train_test_split(bc, labels)

In [8]:

clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)

Out[8]:

LogisticRegression(max_iter=10000)

How did clf come up with 1s and 0s?

In [9]:

clf.predict(X_test)

Out[9]:

array([1, 1, 1, ..., 1, 0, 0])

It turns out that the predicted labels come from applying a threshold of 0.5 to the predicted probabilities. We can access the predicted probabilities via the predict_proba method:

In [10]:

# [:, 1] refers to the predicted probabilities for class 1
clf.predict_proba(X_test)

Out[10]:

array([[0.02, 0.98],
       [0.  , 1.  ],
       [0.  , 1.  ],
       ...,
       [0.  , 1.  ],
       [0.91, 0.09],
       [1.  , 0.  ]])

Note that our model still has $w^*$s:

In [11]:

clf.intercept_

Out[11]:

array([-26.93])

In [12]:

clf.coef_

Out[12]:

array([[-1.04, -0.17,  0.19, ...,  0.48,  0.43,  0.1 ]])

Evaluating our model¶

Let's see how well our model does on the test set.

In [13]:

from sklearn import metrics

In [14]:

y_pred = clf.predict(X_test)

Which metric is more important for this task – precision or recall?

In [15]:

metrics.confusion_matrix(y_test, y_pred)

Out[15]:

array([[87,  3],
       [ 5, 48]])

In [16]:

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test);
plt.grid(False)

In [17]:

metrics.accuracy_score(y_test, y_pred)

Out[17]:

0.9440559440559441

In [18]:

metrics.precision_score(y_test, y_pred)

Out[18]:

0.9411764705882353

In [19]:

metrics.recall_score(y_test, y_pred)

Out[19]:

0.9056603773584906

What if we choose a different threshold?¶

🤔 Question: Suppose we choose a threshold higher than 0.5. What will happen to our model's precision and recall?

🙋 Answer: Precision will increase, while recall will decrease*.

If the "bar" is higher to predict 1, then we will have fewer false positives.
The denominator in $\text{precision} = \frac{TP}{TP + FP}$ will get smaller, and so precision will increase.
However, the number of false negatives will increase, as we are being more "strict" about what we classify as positive, and so $\text{recall} = \frac{TP}{TP + FN}$ will decrease.
*It is possible for either or both to stay the same, if changing the threshold slightly (e.g. from 0.5 to 0.500001) doesn't change any predictions.

Similarly, if we decrease our threshold, our model's precision will decrease, while its recall will increase.

Trying several thresholds¶

The classification threshold is not actually a hyperparameter of LogisticRegression, because the threshold doesn't change the coefficients ($w^*$s) of the logistic regression model itself (see this article for more details).

Still, the threshold affects our decision rule, so we can tune it using CV.
It's also useful to plot how our metrics change as we change the threshold.

In [20]:

thresholds = np.arange(0.01, 1.01, 0.01)
precisions = np.array([])
recalls = np.array([])

for t in thresholds:
    y_pred = clf.predict_proba(X_test)[:, 1] >= t
    precisions = np.append(precisions, metrics.precision_score(y_test, y_pred, zero_division=1))
    recalls = np.append(recalls, metrics.recall_score(y_test, y_pred))

Let's visualize the results in plotly, which is interactive.

In [21]:

px.line(x=thresholds, y=precisions,
        labels={'x': 'Threshold', 'y': 'Precision'}, title='Precision vs. Threshold', width=1000, height=600)

In [22]:

px.line(x=thresholds, y=recalls, 
        labels={'x': 'Threshold', 'y': 'Recall'}, title='Recall vs. Threshold', width=1000, height=600)

In [23]:

px.line(x=recalls, y=precisions, hover_name=thresholds, 
        labels={'x': 'Recall', 'y': 'Precision'}, title='Precision vs. Recall')

The above curve is called a precision-recall (or PR) curve.

🤔 Question: Based on the PR curve above, what threshold would you choose?

Combining precision and recall¶

If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the F1-score:

$$\text{F1-score} = \text{harmonic mean}(PR, RE) = 2\frac{PR \cdot RE}{PR + RE}$$

In [24]:

pr = metrics.precision_score(y_test, clf.predict(X_test))
re = metrics.recall_score(y_test, clf.predict(X_test))

2 * pr * re / (pr + re)

Out[24]:

0.923076923076923

In [25]:

metrics.f1_score(y_test, clf.predict(X_test))

Out[25]:

0.923076923076923

Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.

In [26]:

metrics.accuracy_score(y_test, clf.predict(X_test))

Out[26]:

0.9440559440559441

Other evaluation metrics for binary classifiers¶

We just scratched the surface! This excellent table from Wikipedia summarizes the many other metrics that exist.

If you're interested in exploring further, a good next metric to look at is true negative rate (i.e. specificity), which is the analogue of recall for true negatives.

🙋🙋🏽‍♀️ Questions?¶

https://app.sli.do/event/2LZSnXWNpGPiuVnCZMa5J8

Parting thoughts¶

Course goals ✅¶

In this course, you...

Practiced translating potentially vague questions into quantitative questions about measurable observations.
Learned to reason about 'black-box' processes (e.g. complicated models).
Understood computational and statistical implications of working with data.
Learned to use real data tools (e.g. love the documentation!).
Got a taste of the "life of a data scientist".

Course outcomes ✅¶

Now, you...

Are prepared for internships and data science "take home" interviews!
Are ready to create your own portfolio of personal projects.
Have the background and maturity to succeed in the upper-division.

Topics covered ✅¶

We learnt a lot this quarter.

Week 1: From BabyPandas to Pandas
Week 2: DataFrames
Week 3: Messy Data, Hypothesis Testing
Week 4: Missing Values and Imputation
Week 5: HTTP, Midterm Exam
Week 6: Web Scraping, Regex
Week 7: Text Features, Regression
Week 8: Feature Engineering
Week 9: Generalization, CV, Decision Trees
Week 10: Random Forests, Classifier Evaluation
Week 11: Final Exam

🛠️ Guide to doing independent work, getting research lab positions, internships, etc.¶

[I'll draw on tablet for this]

Thank you!¶

This course would not have been possible without our 7 tutors and 2 TAs: Dylan Stockard, Giorgia Nicolaou, Gabriel Cha, Lauren (Luran) Zhang, John (Jiayu) Chen, Sunan Xu, Doris (Ge) Gao, Tiffany Yu, and Zelong Wang.
Don't be a stranger – our contact information is at dsc80.com/staff!
- This quarter's course website will remain online permanently at dsc-courses.github.io.
Apply to be a tutor in the future! Learn more here.

Final Review¶

With the time we have left, I'll cover tricky questions from past exams, as well as questions you may have!

https://app.sli.do/event/2LZSnXWNpGPiuVnCZMa5J8

Lecture 18 – Classifier Evaluation, Conclusion, Final Review¶

DSC 80, Fall 2023¶

📣 Announcements 📣¶

📝 Final Exam¶

🙋🙋🏽‍♀️ Questions?¶

Example: Tumor malignancy prediction (via logistic regression)¶

Wisconsin breast cancer dataset¶

Aside: Logistic regression¶

Fitting a logistic regression model¶

Evaluating our model¶

What if we choose a different threshold?¶

Trying several thresholds¶

Combining precision and recall¶

Other evaluation metrics for binary classifiers¶

🙋🙋🏽‍♀️ Questions?¶

Parting thoughts¶

Course goals ✅¶

Course outcomes ✅¶

Topics covered ✅¶

🛠️ Guide to doing independent work, getting research lab positions, internships, etc.¶

Thank you!¶

Final Review¶

Good luck on the Final Exam, and enjoy your winter break! 🎉