Lecture 18 – Text as Data

DSC 80, Spring 2022



Remember to refer to dsc80.com/resources/#regular-expressions.

Example: Log parsing

Recall the log string from a few lectures ago.

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s.

While above regex works, it is not very specific. It works on incorrectly formatted log strings.

The more specific, the better!

A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified.

Another character class

The \b character class refers to "word boundaries". It matches anything that separates letters/digits/underscores.

Remember, the \w character class refers to letters, digits, and underscores, i.e. "word" characters.

Question: What's with the \\?

Aside: "raw" strings

Sometimes, the regex meaning and Python meaning of a special string clash.

To prevent Python from interpreting \b, \n, etc. as its own special strings, and to keep them in their "raw" form, use raw strings.

To create a raw string, add the character r right before the quotes.

Raw strings can help us avoid misinterpretations like the one below.

If you don't want to use a raw string, you'd instead have to escape the \b with another \, as we did on the previous slide:


Limitations of regexes

Writing a regular expression is like writing a program.

Regular expressions are terrible at certain types of problems. Examples:

Below is a regular expression that validates email addresses in Perl. See this article for more details.

StackOverflow crashed due to regex! See this article for the details.


Quantifying text data

Quantifying text data

Example: San Diego employee salaries

Recall, in Lectures 1 and 2, we worked with a dataset of San Diego city employee salaries in 2020.

We asked the question, "Does gender influence pay?"

A followup question to that was "Do men and women make similar salaries amongst those with similar jobs?" – but what makes two jobs similar?

Exploring job titles

How many job titles are there in the dataset? How many unique job titles are there?

What are the most common job titles?

Messiness of job titles

Run the cell below repeatedly to get a feel for the "messiness" of job titles in their current state.

Canonicalizing job titles

Let's try to canonicalize job titles. To do this, we'll look at:


Are there job titles with unnecessary punctuation that we can remove?

It seems like we should replace these pieces of punctuation with a single space.

"Glue" words

Are there job titles with "glue" words in the middle, such as Assistant to the Chief?

To figure out if any titles contain the word 'to', we can't just do the following, because it will evaluate to True for job titles that have 'to' anywhere in them, even if not as a standalone word.

Instead, we need to look for 'to' separated by word boundaries.

We can look for other filler words too, like 'the' and 'for'.

We should probably remove these "glue" words.

Fixing punctuation and removing "glue" words

To canonicalize job titles, we'll start by:


Which job titles are inconsistently described? Let's look at three categories – librarians, engineers, and directors.

The limits of canonicalization

Bag of words 👜

A counts matrix

Let's create a "counts" matrix, such that:

Such a matrix might look like:

senior lecturer teaching professor assistant associate
senior lecturer 1 1 0 0 0 0
assistant teaching professor 0 0 1 1 1 0
associate professor 0 0 0 1 0 1
senior assistant to the assistant professor 1 0 0 1 2 0

Creating a counts matrix

First, we need to determine all words that are used across all job titles.

Next, we need to find a list of all unique words used in titles. (We can do this with np.unique, but value_counts shows us the distribution, which is interesting.)

For each of the 435 unique words that are used in job titles, we can count the number of occurrences of the word in each job title.

counts_df has one row for all 12605 job titles (employees), and one column for each unique word that is used in a job title.

To put into context what the numbers in counts_df mean, we can show the actual job title for each row.

The first row tells us that the first job title contains 'police' once and 'officer' once. The fifth row tells us that the fifth job title contains 'fire' once.

Interpreting the counts matrix

The Series below describes the 20 most common words used in job titles, along with the number of times they appeared in all job titles (including repeats). We will call these words "top 20".

The Series below describes the number of top 20 words used in each job title.

Question: What job titles are most similar to 'asst fire chief'?

To start, let's compare 'asst fire chief' to 'fire battalion chief'.

We can stack these two vectors horizontally.

One way to measure how similar the above two vectors are is through their dot product.

Here, since both vectors consist only of 1s and 0s, the dot product is equal to the number of shared words between the two job titles.

Aside: dot product

$$\vec{a} \cdot \vec{b} = a_1b_1 + a_2b_2 + ... + a_nb_n$$
$$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta$$

Computing similarities

To find the job title that is most similar to 'asst fire chief', we can compute the dot product of the 'asst fire chief' word vector with all other titles' word vectors, and find the title with the highest dot product.

To do so, we can apply np.dot to each row that doesn't correspond to 'asst fire chief'.

The unique job titles that are most similar to 'asst fire chief' are given below.

Note that they all share two words in common with 'asst fire chief'.

Note: To truly use the dot product as a measure of similarity, we should normalize by the lengths of the word vectors. More on this soon.

Bag of words


Cosine similarity and bag of words

To measure the similarity between two word vectors, we compute their dot product, also known as their cosine similarity.

$$\cos \theta = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| | \vec{b}|}$$

If $\cos \theta$ is large, the two word vectors are similar. It is important to normalize by the lengths of the vectors, otherwise texts with more words will have artificially high similarities with other texts.

Note: Sometimes, you will see the cosine distance being used. It is the complement of cosine similarity:

$$\text{dist}(\vec{a}, \vec{b}) = 1 - \cos \theta$$

If $\text{dist}(\vec{a}, \vec{b})$ is small, the two word vectors are similar.

A recipe for computing similarities

Given a set of texts, to find the most similar text to one text $T$ in particular:

Example: Global warming 🌎

Consider the following sentences.

Let's represent each sentence using the bag of words model.

Let's now find the cosine similarity between each sentence.

Issue: Bag of words only encodes the words that each sentence uses, not their meanings.

Pitfalls of the bag of words model

Remember, the key assumption underlying the bag of words model is that two texts are similar if they share many words in common.

Summary, next time