Lecture 19 – Bag of Words, TF-IDF

DSC 80, Winter 2023

📣 Announcements

Agenda

Bag of words 💰

Example: San Diego employee salaries

Recall, we're working with a (real) dataset of salary data for all San Diego city employees.

Our goal is to quantify how similar two job titles are; so far, our metric has been the number of shared words between the two titles.

A counts matrix

Let's create a "counts" matrix, such that:

Question: What job titles are most similar to 'deputy fire chief'?

To start, let's compare the row vectors for 'deputy fire chief' and 'fire battalion chief'.

We can stack these two vectors horizontally.

One way to measure how similar the above two vectors are is through their dot product.

Here, since both vectors consist only of 1s and 0s, the dot product is equal to the number of shared words between the two job titles.

Aside: Dot product

$$\vec{a} \cdot \vec{b} = a_1b_1 + a_2b_2 + ... + a_nb_n$$
$$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta$$
(source)

Computing similarities

To find the job title that is most similar to 'deputy fire chief', we can compute the dot product of the 'deputy fire chief' word vector with all other titles' word vectors, and find the title with the highest dot product.

To do so, we can apply np.dot to each row that doesn't correspond to 'deputy fire chief'.

The unique job titles that are most similar to 'deputy fire chief' are given below.

Note that they all share two words in common with 'deputy fire chief'.

Note: To truly use the dot product as a measure of similarity, we should normalize by the lengths of the word vectors.

Bag of words

(source)

Aside: Interactive bag of words demo

Check this site out – it automatically generates a bag of words matrix for you!

(source)

Cosine similarity

Cosine similarity and bag of words

To measure the similarity between two word vectors, we compute their normalized dot product, also known as their cosine similarity.

$$\cos \theta = \boxed{\frac{\vec{a} \cdot \vec{b}}{|\vec{a}| | \vec{b}|}}$$

If $\cos \theta$ is large, the two word vectors are similar. It is important to normalize by the lengths of the vectors, otherwise texts with more words will have artificially high similarities with other texts.

Note: Sometimes, you will see the cosine distance being used. It is the complement of cosine similarity:

$$\text{dist}(\vec{a}, \vec{b}) = 1 - \cos \theta$$

If $\text{dist}(\vec{a}, \vec{b})$ is small, the two word vectors are similar.

A recipe for computing similarities

Given a set of documents, to find the most similar text to one document $d$ in particular:

Example: Global warming 🌎

Consider the following three documents.

Let's represent each document using the bag of words model.

Let's now find the cosine similarity between each document.

Issue: Bag of words only encodes the words that each document uses, not their meanings.

Pitfalls of the bag of words model

Remember, the key assumption underlying the bag of words model is that two documents are similar if they share many words in common.

TF-IDF

The importance of words

Issue: The bag of words model doesn't know which words are "important" in a document. Consider the following document:

"my brother has a friend named billy who has an uncle named billy"

How do we determine which words are important?

Goal: Find a way of quantifying the importance of a word in a document by balancing the above two factors, i.e. find the word that best summarizes a document.

Term frequency

$$\text{tf}(t, d) = \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of words in $d$}}$$
"my brother has a friend named billy who has an uncle named billy"

Inverse document frequency

$$\text{idf}(t) = \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right)$$

Intuition

$$\text{tf}(t, d) = \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of words in $d$}}$$$$\text{idf}(t) = \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right)$$

Goal: Quantify how well word $t$ summarizes document $d$.

Term frequency-inverse document frequency

The term frequency-inverse document frequency (TF-IDF) of word $t$ in document $d$ is the product:

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of words in $d$}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right) \end{align*} $$

Computing TF-IDF

Question: What is the TF-IDF of "global" in the second sentence?

Answer

Question: Is this big or small? Is "global" the best summary of the second sentence?

TF-IDF of all words in all documents

On its own, the TF-IDF of a word in a document doesn't really tell us anything; we must compare it to TF-IDFs of other words in that same document.

Interpreting TF-IDFs

The above DataFrame tells us that:

Note that there are two ways that $\text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t)$ can be 0:

The word that best summarizes a document is the word with the highest TF-IDF for that document:

Look closely at the rows of tfidf – in documents 1 and 2, the max TF-IDF is not unique!

Example: State of the Union addresses 🎤

State of the Union addresses

The 2023 State of the Union address was on February 7th, 2023.

The data

The entire corpus (another word for "set of documents") is over 10 million characters long... let's not display it in our notebook.

Each speech is separated by '***'.

Note that each "speech" currently contains other information, like the name of the president and the date of the address.

Let's extract just the speech text.

Finding the most important words in each speech

Here, a "document" is a speech. We have 233 documents.

A rough sketch of what we'll compute:

for each word t:
    for each speech d:
        compute tfidf(t, d)

Note that the TF-IDFs of many common words are all 0!

Summarizing speeches

By using idxmax, we can find the word with the highest TF-IDF in each speech.

What if we want to see the 5 words with the highest TF-IDFs, for each speech?

Run the cell below to see every single row of keywords_df.

Aside: What if we remove the $\log$ from $\text{idf}(t)$?

Let's try it and see what happens.

The role of $\log$ in $\text{idf}(t)$

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of words in $d$}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right) \end{align*} $$

Summary, next time

Summary

Next time

Modeling and feature engineering.