Lecture 17 – Regular Expressions¶

DSC 80, Winter 2023¶

📣 Announcements¶

Lab 6 (web scraping and APIs) is due today at 4:00PM (no slip days!).
- No slip days are allowed because we will take up the solutions in Discussion 6 at 5:00PM today.
Project 3 is due tomorrow at 11:59PM.
Midterm Exam scores are released, and regrades are due tonight at 11:59PM.
- Remember that it's only worth 10%, and that we have a redemption policy.
Lab 7 (regular expressions and text features) is due on Monday, February 27th at 11:59PM.
Followup from last class: BeautifulSoup objects are mutable! See this post on Ed by Trey for more details.

Agenda¶

Lots and lots of regular expressions! Good resources:

regex101.com, a helpful site to have open while writing regular expressions.
Python re library documentation and how-to.
- The "how-to" is great, read it!
regex "cheat sheet" (taken from here).

See dsc80.com/resources/#regular-expressions.

Motivation¶

In [1]:

contact = '''
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.
'''

Who called? 📞¶

Goal: Extract all phone numbers from a piece of text, assuming they are of the form '(###) ###-####'.

In [2]:

print(contact)

Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.

We can do this using the same string methods we've come to know and love.

Strategy:
- Split by spaces.
- Check if there are any consecutive "words" where:
  - the first "word" looks like an area code, like '(678)'.
  - the second "word" looks like the last 7 digits of a phone number, like '999-8212'.

Let's first write a function that takes in a string and returns whether it looks like an area code.

In [3]:

def is_possibly_area_code(s):
    '''Does `s` look like (678)?'''
    return len(s) == 5 and s.startswith('(') and s.endswith(')') and s[1:4].isnumeric()

In [4]:

is_possibly_area_code('(123)')

Out[4]:

True

In [5]:

is_possibly_area_code('(99)')

Out[5]:

False

Let's also write a function that takes in a string and returns whether it looks like the last 7 digits of a phone number.

In [6]:

def is_last_7_phone_number(s):
    '''Does `s` look like 999-8212?'''
    return len(s) == 8 and s[0:3].isnumeric() and s[3] == '-' and s[4:].isnumeric()

In [7]:

is_last_7_phone_number('999-8212')

Out[7]:

True

In [8]:

is_last_7_phone_number('534 1100')

Out[8]:

False

Finally, let's split the entire text by spaces, and check whether there are any instances where pieces[i] looks like an area code and pieces[i+1] looks like the last 7 digits of a phone number.

In [9]:

# Removes punctuation from the end of each string.
pieces = [s.rstrip('.,?;"\'') for s in contact.split()]

for i in range(len(pieces) - 1):
    if is_possibly_area_code(pieces[i]):
        if is_last_7_phone_number(pieces[i+1]):
            print(pieces[i], pieces[i+1])

(800) 867-5309
(800) 123-4567

Is there a better way?¶

This was an example of pattern matching.

It can be done with string methods, but there is often a better approach: regular expressions.

In [10]:

print(contact)

Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.

In [11]:

import re
re.findall(r'\(\d{3}\) \d{3}-\d{4}', contact)

Out[11]:

['(800) 867-5309', '(800) 123-4567']

🤯

Basic regular expressions¶

Regular expressions¶

A regular expression, or regex for short, is a sequence of characters used to match patterns in strings.
- For example, $\d{3}$ \d{3}-\d{4} describes a pattern that matches US phone numbers of the form '(XXX) XXX-XXXX'.
- Think of regex as a "mini-language" (formally: they are a grammar for describing a language).

Pros: They are very powerful and are widely used (virtually every programming language has a module for working with them).

Cons: They can be hard to read and have many different "dialects."

Writing regular expressions¶

You will ultimately write most of your regular expressions in Python, using the re module. We will see how to do so shortly.

However, a useful tool for designing regular expressions is regex101.com.

We will use it heavily during lecture; you should have it open as we work through examples. If you're trying to revisit this lecture in the future, you'll likely want to watch the podcast.

Literals¶

A literal is a character that has no special meaning.

Letters, numbers, and some symbols are all literals.

Some symbols, like ., *, (, and ), are special characters.

Example: The regex hey matches the string 'hey'. The regex he. also matches the string 'hey'.

Regex building blocks 🧱¶

The four main building blocks for all regexes are shown below (table source, inspiration).

operation	order of op.	example	matches ✅	does not match ❌
concatenation	3	`AABAAB`	`'AABAAB'`	every other string
or	4	`AA\|BAAB`	`'AA'`, `'BAAB'`	every other string
closure (zero or more)	2	`AB*A`	`'AA'`, `'ABBBBBBA'`	`'AB'`, `'ABABA'`
parentheses	1	`A(A\|B)AAB` `(AB)*A`	`'AAAAB'`, `'ABAAB'` `'A'`, `'ABABABABA'`	every other string `'AA'`, `'ABBA'`

Note that |, (, ), and * are special characters, not literals. They manipulate the characters around them.

Example (or, parentheses):

What does DSC 30|80 match?
What does DSC (30|80) match?

Example (closure, parentheses):

What does blah* match?
What does (blah)* match?

Exercise¶

Write a regular expression that matches 'billy', 'billlly', 'billlllly', etc.

First, think about how to match strings with any even number of 'l's, including zero 'l's (i.e. 'biy').
Then, think about how to match only strings with a positive even number of 'l's.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

bi(ll)*y will match any even number of 'l's, including 0. To match only a positive even number of 'l's, we'd need to first "fix into place" two 'l's, and then follow that up with zero or more pairs of 'l's. This specifies the regular expression bill(ll)*y.

Exercise¶

Write a regular expression that matches 'billy', 'billlly', 'biggy', 'biggggy', etc.

Specifically, it should match any string with a positive even number of 'l's in the middle, or a positive even number of 'g's in the middle.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

Possible answers: bi(ll(ll)\*|gg(gg)\*)y or bill(ll)\*y|bigg(gg)\*y.
Note, bill(ll)\*|gg(gg)\*y is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)\*, like 'billll', OR strings that match gg(gg)\*y, like 'ggy'.

Intermediate regex¶

More regex syntax¶

operation	example	matches ✅	does not match ❌
wildcard	`.U.U.U.`	`'CUMULUS'` `'JUGULUM'`	`'SUCCUBUS'` `'TUMULTUOUS'`
character class	`[A-Za-z][a-z]*`	`'word'` `'Capitalized'`	`'camelCase'` `'4illegal'`
at least one	`bi(ll)+y`	`'billy'` `'billlllly'`	`'biy'` `'bily'`
between $i$ and $j$ occurrences	`m[aeiou]{1,2}m`	`'mem'` `'maam'` `'miem'`	`'mm'` `'mooom'` `'meme'`

., [, ], +, {, and } are also special characters, in addition to |, (, ), and *.

Example (character classes, at least one): [A-E]+ is just shortform for (A|B|C|D|E)(A|B|C|D|E)*.

Example (wildcard):

What does . match?
What does he. match?
What does ... match?

Example (at least one, closure):

What does 123+ match?
What does 123* match?

Example (number of occurrences): What does tri{3, 5} match? Does it match 'triiiii'?

Example (character classes, number of occurrences): What does [1-6a-f]{3}-[7-9E-S]{2} match?

Exercise¶

Write a regular expression that matches any lowercase string has a repeated vowel, such as 'noon', 'peel', 'festoon', or 'zeebraa'.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

One answer: [a-z]\*(aa|ee|ii|oo|uu)[a-z]\*
This regular expression matches strings of lowercase characters that have 'aa', 'ee', 'ii', 'oo', or 'uu' in them anywhere. [a-z]\* means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.

Exercise¶

Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'billy80', '80!!billy', and 'bil8ly0'.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

One answer: (.\*[a-z].\*[0-9].\*)|(.\*[0-9].\*[a-z].\*)
We can break the above regex into two parts – everything before the `|`, and everything after the `|`. The first part, .\*[a-z].\*[0-9].\*, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first. The second part, .\*[0-9].\*[a-z].\*, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first. Note, the .\* between the digit and letter classes is needed in the event the string has non-digit and non-letter characters. This is the kind of task that would be easier to accomplish with regular Python string methods.

Even more regex syntax¶

operation	example	matches ✅	does not match ❌
escape character	`ucsd\.edu`	`'ucsd.edu'`	`'ucsd!edu'`
beginning of line	`^ark`	`'ark two'` `'ark o ark'`	`'dark'`
end of line	`ark$`	`'dark'` `'ark o ark'`	`'ark two'`
zero or one	`cat?`	`'ca'` `'cat'`	`'cart'` (matches `'ca'` only)
built-in character classes*	`\w+` `\d+`	`'billy'` `'231231'`	`'this person'` `'858 people'`
character class negation	`[^a-z]+`	`'KINGTRITON551'` `'1721$$'`	`'porch'` `'billy.edu'`

*Note: in Python's implementation of regex,

\d refers to digits.
\w refers to alphanumeric characters ([A-Z][a-z][0-9]_).
\s refers to whitespace.
\b is a word boundary.

Example (escaping):

What does he. match?
What does he\. match?
What does (858) match?
What does $858$ match?

Example (anchors):

What does 858-534 match?
What does ^858-534 match?
What does 858-534$ match?

Example (built-in character classes)¶

*Note: in Python's implementation of regex,

\d refers to digits.
\w refers to alphanumeric characters ([A-Z][a-z][0-9]_).
\s refers to whitespace.
\b is a word boundary.

What does \d{3} \d{3}-\d{4} match?
What does \bcat\b match? Does it find a match in 'my cat is hungry'? What about 'concatenate'?

Exercise¶

Write a regular expression that matches any string that:

is between 5 and 10 characters long, and
is made up of only vowels (either uppercase or lowercase, including 'Y' and 'y'), periods, and spaces.

Examples include 'yoo.ee.IOU' and 'AI.I oey'.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

One answer: ^[aeiouyAEIOUY. ]{5,10}$
Key idea: Within a character class (i.e. [...]), special characters do not generally need to be escaped.

Regex in Python¶

`re` in Python¶

The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

In [12]:

import re

re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text.

In [13]:

re.search('AB*A', 
          'here is a string for you: ABBBA. here is another: ABBBBBBBA')

Out[13]:

<re.Match object; span=(26, 31), match='ABBBA'>

re.findall takes in a string regex and a string text and returns a list of all matches of regex in text. You'll use this most often.

In [14]:

re.findall('AB*A', 
           'here is a string for you: ABBBA. here is another: ABBBBBBBA')

Out[14]:

['ABBBA', 'ABBBBBBBA']

re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl.

In [15]:

re.sub('AB*A', 
       'billy', 
       'here is a string for you: ABBBA. here is another: ABBBBBBBA')

Out[15]:

'here is a string for you: billy. here is another: billy'

Raw strings¶

When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r before the quotes, e.g. r'exp'.

In [16]:

re.findall('\bcat\b', 'my cat is hungry')

Out[16]:

[]

In [17]:

re.findall(r'\bcat\b', 'my cat is hungry')

Out[17]:

['cat']

In [18]:

# Huh?
print('\bcat\b')

cat

Capture groups¶

Surround a regex with ( and ) to define a capture group within a pattern.
Capture groups are useful for extracting relevant parts of a string.

In [19]:

re.findall(r'\w+@(\w+)\.edu', 
           'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')

Out[19]:

['notucsd', 'ucsd']

Notice what happens if we remove the ( and )!

In [20]:

re.findall(r'\w+@\w+\.edu', 
           'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')

Out[20]:

['billy@notucsd.edu', 'notbilly@ucsd.edu']

Earlier, we also saw that parentheses can be used to group parts of a regex together. When using re.findall, all groups are treated as capturing groups.

In [21]:

# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')

Out[21]:

[('oo', '124')]

Summary, next time¶

Summary¶

Regular expressions are used to match and extract patterns from text.
You don't need to force yourself to "memorize" regex syntax – refer to the resources in the Agenda section of the lecture and on the Resources tab of the course website.
Also refer to the three tables of syntax in the lecture:
Note: You don't always have to use regular expressions! If Python/pandas string methods work for your task, you can still use those.
Play Regex Golf to practice! 🏌️

Next time¶

A few more examples of regular expressions.
Using regular expressions in pandas (through .str).
Describing text data quantitatively.

Lecture 17 – Regular Expressions¶

DSC 80, Winter 2023¶

📣 Announcements¶

Agenda¶

Motivation¶

Who called? 📞¶

Is there a better way?¶

🤯

Basic regular expressions¶

Regular expressions¶

Writing regular expressions¶

Literals¶

Regex building blocks 🧱¶

Exercise¶

Exercise¶

Intermediate regex¶

More regex syntax¶

Exercise¶

Exercise¶

Even more regex syntax¶

Example (built-in character classes)¶

Exercise¶

Regex in Python¶

re in Python¶

Raw strings¶

Capture groups¶

Summary, next time¶

Summary¶

Next time¶

`re` in Python¶