Lecture 19 – Text as Data, Continued

DSC 80, Spring 2022

Announcements

Agenda

Bag of words 💰

Recap

Recall, last class we created a counts matrix out of a Series containing San Diego employees' job titles.

Question: What job titles are most similar to 'asst fire chief'?

To start, let's compare 'asst fire chief' to 'fire battalion chief'.

We can stack these two vectors horizontally.

One way to measure how similar the above two vectors are is through their dot product.

Here, since both vectors consist only of 1s and 0s, the dot product is equal to the number of shared words between the two job titles.

Aside: dot product

$$\vec{a} \cdot \vec{b} = a_1b_1 + a_2b_2 + ... + a_nb_n$$
$$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta$$

Computing similarities

To find the job title that is most similar to 'asst fire chief', we can compute the dot product of the 'asst fire chief' word vector with all other titles' word vectors, and find the title with the highest dot product.

To do so, we can apply np.dot to each row that doesn't correspond to 'asst fire chief'.

The unique job titles that are most similar to 'asst fire chief' are given below.

Note that they all share two words in common with 'asst fire chief'.

Note: To truly use the dot product as a measure of similarity, we should normalize by the lengths of the word vectors. More on this soon.

Bag of words

(source)

Cosine similarity and bag of words

To measure the similarity between two word vectors, we compute their cosine similarity.

$$\cos \theta = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| | \vec{b}|}$$

If $\cos \theta$ is large, the two word vectors are similar. It is important to normalize by the lengths of the vectors, otherwise documents with more words will have artificially high similarities with other documents.

Note: Sometimes, you will see the cosine distance being used. It is the complement of cosine similarity:

$$\text{dist}(\vec{a}, \vec{b}) = 1 - \cos \theta$$

If $\text{dist}(\vec{a}, \vec{b})$ is small, the two word vectors are similar.

A recipe for computing similarities

Given a set of documents, to find the most similar document to one document $D$ in particular:

Example: Global warming 🌎

Consider the following sentences (each of which is a "document").

Let's represent each sentence using the bag of words model.

Let's now find the cosine similarity between each sentence.

Issue: Bag of words only encodes the words that each sentence uses, not their meanings.

Pitfalls of the bag of words model

Remember, the key assumption underlying the bag of words model is that two documents are similar if they share many words in common.

TF-IDF

The importance of words

Issue: The bag of words model doesn't know which words are "important" in a document. How do we determine which words are important?

Goal: Find a way of quantifying the importance of a word in a document by balancing the above two factors.

Term frequency

$$\text{tf}(t, d) = \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}}$$
"my brother has a friend named billy who has an uncle named billy"

Inverse document frequency

$$\text{idf}(t) = \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right)$$

Intuition

$$\text{tf}(t, d) = \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}}$$$$\text{idf}(t) = \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right)$$

Goal: Quantify how well word $t$ summarizes document $d$.

Term frequency-inverse document frequency

The term frequency-inverse document frequency (TF-IDF) of word $t$ in document $d$ is the product:

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

Computing TF-IDF

Question: What is the TF-IDF of "global" in the second sentence?

Answer

Question: Is this big or small? Is "global" the best summary of the second sentence?

TF-IDF of all words in all documents

On its own, the TF-IDF of a word in a document doesn't really tell us anything; we must compare it to TF-IDFs of other words in that same document.

The above DataFrame tells us that:

Note that there are two ways that $\text{tfidf}(t, d)$ can be 0:

The word that best summarizes a document is the word with the highest TF-IDF for that document:

Look closely at the rows of tfidf – in sentences 0 and 2, the max TF-IDF is not unique!

Example: State of the Union addresses 🎤

The data

The entire corpus (another word for "set of documents") is over 10 million characters long... let's not display it in our notebook.

Each speech is separated by '***'.

Note that each "speech" currently contains other information, like the name of the president and the date of the address.

Let's extract just the speech text.

Finding the most important words in each speech

Here, a "document" is a speech. We have 232 documents.

A rough sketch of what we'll compute:

for each word w:
    for each speech d:
        compute tfidf(w, d)

Note that the TF-IDFs of many common words are all 0!

Summarizing speeches

By using idxmax, we can find the word with the highest TF-IDF in each speech.

What if we want to see the 5 words with the highest TF-IDFs, for each speech?

Run the cell below to see every single row of keywords_df.

Aside: What if we remove the $\log$ from $\text{idf}(t)$?

Let's try it and see what happens.

The role of $\log$ in $\text{idf}(t)$

$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

Summary, next time

Summary