Lecture 18 – Regular Expressions, Bag of Words

DSC 80, Spring 2023

Agenda

More regular expressions

Even more regex syntax

operation example matches ✅ does not match ❌
escape character ucsd\.edu 'ucsd.edu' 'ucsd!edu'
beginning of line ^ark 'ark two'
'ark o ark'
'dark'
end of line ark$ 'dark'
'ark o ark'
'ark two'
zero or one cat? 'ca'
'cat'
'cart' (matches 'ca' only)
built-in character classes* \w+
\d+
'billy'
'231231'
'this person'
'858 people'
character class negation [^a-z]+ 'KINGTRITON551'
'1721$$'
'porch'
'billy.edu'

Example (built-in character classes)

*Note: in Python's implementation of regex,

Exercise

Write a regular expression that matches any string that:

Examples include 'yoo.ee.IOU' and 'AI.I oey'.


✅ Click here to see the answer after you've tried it yourself at regex101.com. One answer: ^[aeiouyAEIOUY. ]{5,10}$
Key idea: Within a character class (i.e. [...]), special characters do not generally need to be escaped.

Regex in Python

re in Python

The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text.

re.findall takes in a string regex and a string text and returns a list of all matches of regex in text. You'll use this most often.

re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl.

Raw strings

When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r before the quotes, e.g. r'exp'.

Capture groups

Example: Log parsing

Web servers typically record every request made of them in the "logs".

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s.

While above regex works, it is not very specific. It works on incorrectly formatted log strings.

The more specific, the better!

A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified.

Limitations

Limitations of regexes

Writing a regular expression is like writing a program.

Regular expressions are terrible at certain types of problems. Examples:

Below is a regular expression that validates email addresses in Perl. See this article for more details.

StackOverflow crashed due to regex! See this article for the details.

Text features

Review: Regression and features

$$\text{predicted salary} = w_0^* + w_1^* \cdot \text{GPA} + w_2^* \cdot \text{experience} + w_3^* \cdot \text{education}$$

Moving forward

Suppose we'd like to predict the sentiment of a piece of text from 1 to 10.

Example:

Text features

Example: San Diego employee salaries

Aside on privacy and ethics

Goal: Quantifying similarity

Exploring job titles

How many employees are in the dataset? How many unique job titles are there?

What are the most common job titles?

Are there any missing job titles?

There aren't many. To avoid having to deal with missing values later on, let's just drop the two missing job titles now.

Canonicalization

Remember, our goal is ultimately to count the number of shared words between job titles. But before we start counting the number of shared words, we need to consider the following:

Let's address the above issues. The process of converting job titles so that they are always represented the same way is called canonicalization.

Punctuation

Are there job titles with unnecessary punctuation that we can remove?

It seems like we should replace these pieces of punctuation with a single space.

"Glue" words

Are there job titles with "glue" words in the middle, such as 'Assistant to the Manager'?

To figure out if any titles contain the word 'to', we can't just do the following, because it will evaluate to True for job titles that have 'to' anywhere in them, even if not as a standalone word.

Instead, we need to look for 'to' separated by word boundaries.

We can look for other filler words too, like 'the' and 'for'.

We should probably remove these "glue" words.

Fixing punctuation and removing "glue" words

Let's put the following two steps together, and canonicalize job titles by:

Possible issue: inconsistent representations

Another possible issue is that some job titles may have inconsistent representations of the same word (e.g. 'Asst.' vs 'Assistant').

The 2020 salaries dataset had several of these issues, but fortunately they appear to be fixed for us in the 2021 dataset (thanks, Transparent California).

Bag of words 💰

Text similarity

Recall, our idea is to measure the similarity of two job titles by counting the number of shared words between the job titles. How do we actually do that, for all of the job titles we have?

A counts matrix

Let's create a "counts" matrix, such that:

Such a matrix might look like:

senior lecturer teaching professor assistant associate
senior lecturer 1 1 0 0 0 0
assistant teaching professor 0 0 1 1 1 0
associate professor 0 0 0 1 0 1
senior assistant to the assistant professor 1 0 0 1 2 0

Creating a counts matrix

First, we need to determine all words that are used across all job titles.

Next, to determine the columns of our matrix, we need to find a list of all unique words used in titles. We can do this with np.unique, but value_counts shows us the distribution, which is interesting.

Note that in unique_words.index, job titles are sorted by number of occurrences!

For each of the 327 unique words that are used in job titles, we can count the number of occurrences of the word in each job title.

counts_df has one row for all 12303 employees, and one column for each unique word that is used in a job title.

To put into context what the numbers in counts_df mean, we can show the actual job title for each row.

The fourth row tells us that the fourth job title contains 'police' once and 'officer' once.

Interpreting the counts matrix

The Series below describes the 20 most common words used in job titles, along with the number of times they appeared in all job titles (including repeats). We will call these words "top 20" words.

The Series below describes the number of top 20 words used in each job title.

Question: What job titles are most similar to 'deputy fire chief'?

To start, let's compare the row vectors for 'deputy fire chief' and 'fire battalion chief'.

We can stack these two vectors horizontally.

One way to measure how similar the above two vectors are is through their dot product.

Here, since both vectors consist only of 1s and 0s, the dot product is equal to the number of shared words between the two job titles.

Aside: Dot product

$$\vec{a} \cdot \vec{b} = a_1b_1 + a_2b_2 + ... + a_nb_n$$
$$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta$$
(source)

Computing similarities

To find the job title that is most similar to 'deputy fire chief', we can compute the dot product of the 'deputy fire chief' word vector with all other titles' word vectors, and find the title with the highest dot product.

To do so, we can apply np.dot to each row that doesn't correspond to 'deputy fire chief'.

The unique job titles that are most similar to 'deputy fire chief' are given below.

Note that they all share two words in common with 'deputy fire chief'.

Note: To truly use the dot product as a measure of similarity, we should normalize by the lengths of the word vectors. More on this next time.

Bag of words

(source)

Aside: Interactive bag of words demo

Check this site out – it automatically generates a bag of words matrix for you!

(source)

Summary, next time

Summary

Next time