from dsc80_utils import *
Announcements 📣¶
- Midterm Survey due tonight: https://forms.gle/8pMeYeHk6ktLa4867
- If ≥80% of the class fills it out, everyone will get +1% EC on the midterm.
- Lab 6 is due tomorrow.
- The Final Project will be released next week.
Agenda 📆¶
- Last bits of TF-IDF.
- Modeling.
- Case study: Restaurant tips 🧑🍳.
- Regression in
sklearn
. - Announcement about HDSI career services.
Conceptually, today will mostly be review from DSC 40A, but we'll introduce a few new practical tools that we'll build upon next week.
State of the Union addresses¶
from pathlib import Path
sotu_txt = Path('data') / 'stateoftheunion1790-2023.txt'
sotu = sotu_txt.read_text()
speeches = sotu.split('\n***\n')[1:]
import re
def extract_struct(speech):
L = speech.strip().split('\n', maxsplit=3)
L[3] = re.sub(r"[^A-Za-z' ]", ' ', L[3]).lower()
return dict(zip(['speech', 'president', 'date', 'contents'], L))
speeches_df = pd.DataFrame(list(map(extract_struct, speeches)))
speeches_df
speech | president | date | contents | |
---|---|---|---|---|
0 | State of the Union Address | George Washington | January 8, 1790 | fellow citizens of the senate and house of re... |
1 | State of the Union Address | George Washington | December 8, 1790 | fellow citizens of the senate and house of re... |
2 | State of the Union Address | George Washington | October 25, 1791 | fellow citizens of the senate and house of re... |
... | ... | ... | ... | ... |
230 | State of the Union Address | Joseph R. Biden Jr. | April 28, 2021 | thank you thank you thank you good to be b... |
231 | State of the Union Address | Joseph R. Biden Jr. | March 1, 2022 | madam speaker madam vice president and our ... |
232 | State of the Union Address | Joseph R. Biden Jr. | February 7, 2023 | mr speaker madam vice president our firs... |
233 rows × 4 columns
Finding the most important words in each speech¶
Here, a "document" is a speech. We have 233 documents.
speeches_df
speech | president | date | contents | |
---|---|---|---|---|
0 | State of the Union Address | George Washington | January 8, 1790 | fellow citizens of the senate and house of re... |
1 | State of the Union Address | George Washington | December 8, 1790 | fellow citizens of the senate and house of re... |
2 | State of the Union Address | George Washington | October 25, 1791 | fellow citizens of the senate and house of re... |
... | ... | ... | ... | ... |
230 | State of the Union Address | Joseph R. Biden Jr. | April 28, 2021 | thank you thank you thank you good to be b... |
231 | State of the Union Address | Joseph R. Biden Jr. | March 1, 2022 | madam speaker madam vice president and our ... |
232 | State of the Union Address | Joseph R. Biden Jr. | February 7, 2023 | mr speaker madam vice president our firs... |
233 rows × 4 columns
A rough sketch of what we'll compute:
for each word t:
for each speech d:
compute tfidf(t, d)
unique_words = speeches_df['contents'].str.split().explode().value_counts()
# Take the top 500 most common words for speed
unique_words = unique_words.iloc[:500].index
unique_words
Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our', ... 'desire', 'call', 'submitted', 'increasing', 'months', 'point', 'trust', 'throughout', 'set', 'object'], dtype='object', name='contents', length=500)
💡 Pro-Tip: Using tqdm
¶
This code takes a while to run, so we'll use the tdqm
package to track its progress. (Install with mamba install tqdm
if needed).
from tqdm.notebook import tqdm
tfidf_dict = {}
tf_denom = speeches_df['contents'].str.split().str.len()
# Wrap the sequence with `tqdm()` to display a progress bar
for word in tqdm(unique_words):
re_pat = fr' {word} ' # Imperfect pattern for speed.
tf = speeches_df['contents'].str.count(re_pat) / tf_denom
idf = np.log(len(speeches_df) / speeches_df['contents'].str.contains(re_pat).sum())
tfidf_dict[word] = tf * idf
0%| | 0/500 [00:00<?, ?it/s]
tfidf = pd.DataFrame(tfidf_dict)
tfidf.head()
the | of | to | and | ... | trust | throughout | set | object | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 4.29e-04 | 0.00e+00 | 0.00e+00 | 2.04e-03 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.06e-03 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 4.06e-04 | 0.00e+00 | 3.48e-04 | 6.44e-04 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 6.70e-04 | 2.17e-04 | 0.00e+00 | 7.09e-04 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.38e-04 | 4.62e-04 | 0.00e+00 | 3.77e-04 |
5 rows × 500 columns
Note that the TF-IDFs of many common words are all 0!
Summarizing speeches¶
By using idxmax
, we can find the word with the highest TF-IDF in each speech.
summaries = tfidf.idxmax(axis=1)
summaries
0 object 1 convention 2 provision ... 230 it's 231 tonight 232 it's Length: 233, dtype: object
What if we want to see the 5 words with the highest TF-IDFs, for each speech?
def five_largest(row):
return ', '.join(row.index[row.argsort()][-5:])
keywords = tfidf.apply(five_largest, axis=1)
keywords_df = pd.concat([
speeches_df['president'],
speeches_df['date'],
keywords
], axis=1)
keywords_df
president | date | 0 | |
---|---|---|---|
0 | George Washington | January 8, 1790 | your, proper, regard, ought, object |
1 | George Washington | December 8, 1790 | case, established, object, commerce, convention |
2 | George Washington | October 25, 1791 | community, upon, lands, proper, provision |
... | ... | ... | ... |
230 | Joseph R. Biden Jr. | April 28, 2021 | get, americans, percent, jobs, it's |
231 | Joseph R. Biden Jr. | March 1, 2022 | let, jobs, americans, get, tonight |
232 | Joseph R. Biden Jr. | February 7, 2023 | down, percent, jobs, tonight, it's |
233 rows × 3 columns
Uncomment the cell below to see every single row of keywords_df
.
# display_df(keywords_df, rows=233)
Aside: What if we remove the $\log$ from $\text{idf}(t)$?¶
Let's try it and see what happens.
tfidf_nl_dict = {}
tf_denom = speeches_df['contents'].str.split().str.len()
for word in tqdm(unique_words):
re_pat = fr' {word} ' # Imperfect pattern for speed.
tf = speeches_df['contents'].str.count(re_pat) / tf_denom
idf_nl = len(speeches_df) / speeches_df['contents'].str.contains(re_pat).sum()
tfidf_nl_dict[word] = tf * idf_nl
0%| | 0/500 [00:00<?, ?it/s]
tfidf_nl = pd.DataFrame(tfidf_nl_dict)
tfidf_nl.head()
the | of | to | and | ... | trust | throughout | set | object | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.09 | 0.06 | 0.05 | 0.04 | ... | 1.47e-03 | 0.00e+00 | 0.00e+00 | 5.78e-03 |
1 | 0.09 | 0.06 | 0.03 | 0.03 | ... | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.99e-03 |
2 | 0.11 | 0.07 | 0.04 | 0.03 | ... | 1.39e-03 | 0.00e+00 | 1.30e-03 | 1.82e-03 |
3 | 0.09 | 0.07 | 0.04 | 0.03 | ... | 2.29e-03 | 7.53e-04 | 0.00e+00 | 2.01e-03 |
4 | 0.09 | 0.07 | 0.04 | 0.02 | ... | 8.12e-04 | 1.60e-03 | 0.00e+00 | 1.07e-03 |
5 rows × 500 columns
keywords_nl = tfidf_nl.apply(five_largest, axis=1)
keywords_nl_df = pd.concat([
speeches_df['president'],
speeches_df['date'],
keywords_nl
], axis=1)
keywords_nl_df
president | date | 0 | |
---|---|---|---|
0 | George Washington | January 8, 1790 | a, and, to, of, the |
1 | George Washington | December 8, 1790 | in, and, to, of, the |
2 | George Washington | October 25, 1791 | a, and, to, of, the |
... | ... | ... | ... |
230 | Joseph R. Biden Jr. | April 28, 2021 | of, it's, and, to, the |
231 | Joseph R. Biden Jr. | March 1, 2022 | we, of, to, and, the |
232 | Joseph R. Biden Jr. | February 7, 2023 | a, of, and, to, the |
233 rows × 3 columns
The role of $\log$ in $\text{idf}(t)$¶
$$ \begin{align*} \text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{\# of occurrences of $t$ in $d$}}{\text{total \# of words in $d$}} \cdot \log \left(\frac{\text{total \# of documents}}{\text{\# of documents in which $t$ appears}} \right) \end{align*} $$
- Remember, for any positive input $x$, $\log(x)$ is (much) smaller than $x$.
- In $\text{idf}(t)$, the $\log$ "dampens" the impact of the ratio $\frac{\text{\# documents}}{\text{\# documents with $t$}}$.
- If a word is very common, the ratio will be close to 1. The log of the ratio will be close to 0.
(1000 / 999)
1.001001001001001
np.log(1000 / 999)
np.float64(0.001000500333583622)
- If a word is very common (e.g. 'the'), removing the log multiplies the statistic by a large factor.
- If a word is very rare, the ratio will be very large. However, for instance, a word being seen in 2 out of 50 documents is not very different than being seen in 2 out of 500 documents (it is very rare in both cases), and so $\text{idf}(t)$ should be similar in both cases.
(50 / 2)
25.0
(500 / 2)
250.0
np.log(50 / 2)
np.float64(3.2188758248682006)
np.log(500 / 2)
np.float64(5.521460917862246)
Question 🤔 (Answer at dsc80.com/q)
Code: tfidf
From the Fa23 final: Consider the following corpus:
Document number Content
1 yesterday rainy today sunny
2 yesterday sunny today sunny
3 today rainy yesterday today
4 yesterday yesterday today today
Which words have a TF-IDF score of 0 for all four documents?
Modeling¶
Reflection¶
So far this quarter, we've learned how to:
Extract information from tabular data using
pandas
and regular expressions.Clean data so that it best represents an underlying data generating process.
- Missingness analyses and imputation.
Collect data from the internet through scraping and APIs, and parse it using BeautifulSoup.
Perform exploratory data analysis through aggregation, visualization, and the computation of summary statistics like TF-IDF.
Infer about the relationships between samples and populations through hypothesis and permutation testing.
Now, let's make predictions.
Modeling¶
A model is a set of assumptions about how data were generated.
George Box, a famous statistician, once said "All models are wrong, but some are useful." What did he mean?
Philosophy¶
"It has been said that "all models are wrong but some models are useful." In other words, any model is at best a useful fiction—there never was, or ever will be, an exactly normal distribution or an exact linear relationship. Nevertheless, enormous progress has been made by entertaining such fictions and using them as approximations."
"Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity."
Goals of modeling¶
To make accurate predictions regarding unseen data.
- Given this dataset of past UCSD data science students' salaries, can we predict your future salary? (regression)
- Given this dataset of images, can we predict if this new image is of a dog, cat, or zebra? (classification)
To make inferences about complex phenomena in nature.
- Is there a linear relationship between the heights of children and the heights of their biological mothers?
- The weights of smoking and non-smoking mothers' babies babies in my sample are different – how confident am I that this difference exists in the population?
Of the two focuses of models, we will focus on prediction.
In the above taxonomy, we will focus on supervised learning.
We'll start with regression before moving to classification.
Features¶
A feature is a measurable property of a phenomenon being observed.
- Other terms for "feature" include "(explanatory) variable" and "attribute".
- Typically, features are the inputs to models.
In DataFrames, features typically correspond to columns, while rows typically correspond to different individuals.
Some features come as part of a dataset, e.g. weight and height, but others we need to create given existing features, for example: $$\text{BMI} = \frac{\text{weight (kg)}}{\text{[height (m)]}^2}$$
Example: TF-IDF creates features that summarize documents!
Example: Restaurant tips 🧑🍳¶
About the data¶
What features does the dataset contain? Is this likely a recent dataset, or an older one?
# The dataset is built into plotly!
tips = px.data.tips()
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
Predicting tips¶
- Goal: Given various information about a table at a restaurant, we want to predict the tip that a server will earn.
- Why might a server be interested in doing this?
- To determine which tables are likely to tip the most (inference).
- To predict earnings over the next month (prediction).
Exploratory data analysis¶
- The most natural feature to look at first is total bills.
- As such, we should explore the relationship between total bills and tips. Moving forward:
- $x$: Total bills.
- $y$: Tips.
fig = tips.plot(kind='scatter', x='total_bill', y='tip', title='Tip vs. Total Bill')
fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')