Lecture 17 – Regular Expressions¶

DSC 80, Spring 2022¶

Announcements¶

Discussion 5 is due (for extra credit) on Saturday, May 7th at 11:59PM.
Lab 6 is due on Monday, May 9th at 11:59PM.
- You don't have to do Question 3 (even though it might work again).
Project 3 is released, and is due on Thursday, May 12th at 11:59PM.
Later this week, expect to see a "Grade Report" that contains a summary of your scores on all assignments this quarter along with a slip day counter.

Agenda¶

Lots and lots of regular expressions! Good resources:

regex101.com, a helpful site to have open while writing regular expressions.
Python re library documentation and how-to.
- The "how-to" is great, read it!
regex "cheat sheet" (taken from here).

See dsc80.com/resources/#regular-expressions.

Regex fundamentals¶

Regular expressions¶

A regular expression, or regex for short, is a sequence of characters used to match patterns in strings.
- For example, [1-9][0-9]{2}-[0-9]{3}-[0-9]{4} matches US phone numbers of the form 'XXX-XXX-XXXX'.
They are very powerful and widely used.
However, they are quite difficult to read.

Regex building blocks 🧱¶

The four main building blocks for all regexes are shown below (table source, inspiration).

operation	order of op.	example	matches ✅	does not match ❌
concatenation	3	`AABAAB`	`'AABAAB'`	every other string
or	4	`AA\|BAAB`	`'AA'`, `'BAAB'`	every other string
closure (zero or more)	2	`AB*A`	`'AA'`, `'ABBBBBBA'`	`'AB'`, `'ABABA'`
parentheses	1	`A(A\|B)AAB` `(AB)*A`	`'AAAAB'`, `'ABAAB'` `'A'`, `'ABABABABA'`	every other string `'AA'`, `'ABBA'`

Note that |, (, ), and * are special characters, not literals. They manipulate the characters around them.

Example: AB*A matches strings with an 'A', followed by zero or more 'B's, and then an 'A'.

✅ 'AA', 'ABA', 'ABBBBBBBBBBBBBBA'
❌ 'AB', 'ABAB'

Example: (AB)*A matches strings with zero or more 'AB's, followed by an 'A'.

✅ 'A', 'ABA', 'ABABABABA'
❌ 'AA', 'ABBBBBBBA', 'ABAB'

Example 1¶

Write a regular expression that matches 'billy', 'billlly', 'billlllly', etc.

First, think about how to match strings with any even number of 'l's, including zero 'l's (i.e. 'biy').
Then, think about how to match only strings with a positive even number of 'l's.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

bi(ll)*y will match any even number of 'l's, including 0. To match only a positive even number of 'l's, we'd need to first "fix into place" two 'l's, and then follow that up with zero or more pairs of 'l's. This specifies the regular expression bill(ll)*y.

Example 2¶

Write a regular expression that matches 'billy', 'billlly', 'biggy', 'biggggy', etc.

Specifically, it should match any string with a positive even number of 'l's in the middle, or a positive even number of 'g's in the middle.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

Possible answers: bi(ll(ll)*|gg(gg)*)y or bill(ll)*y|bigg(gg)*y.
Note, bill(ll)*|gg(gg)*y is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)*, like 'billll', OR strings that match gg(gg)*y, like 'ggy'.

More regex syntax¶

operation	example	matches ✅	does not match ❌
wildcard	`.U.U.U.`	`'CUMULUS'` `'JUGULUM'`	`'SUCCUBUS'` `'TUMULTUOUS'`
character class	`[A-Za-z][a-z]*`	`'word'` `'Capitalized'`	`'camelCase'` `'4illegal'`
at least one	`bi(ll)+y`	`'billy'` `'billlllly'`	`'biy'` `'bily'`
between a and b occurrences	`m[aeiou]{1,2}m`	`'mem'` `'maam'` `'miem'`	`'mm'` `'mooom'` `'meme'`

., [, ], +, {, and } are also special characters, in addition to |, (, ), and *.

Example: [A-E]+ is just shortform for (A|B|C|D|E)(A|B|C|D|E)*.

Example 3¶

Write a regular expression that matches any lowercase string has a repeated vowel, such as 'noon', 'peel', 'festoon', or 'zeebraa'.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

One answer: [a-z]*(aa|ee|ii|oo|uu)[a-z]*
This regular expression matches strings of lowercase characters that have 'aa', 'ee', 'ii', 'oo', or 'uu' in them anywhere. [a-z]* means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.

Example 4¶

Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'billy80', '80!!billy', and 'bil8ly0'.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

One answer: (.*[a-z].*[0-9].*)|(.*[0-9].*[a-z].*)
We can break the above regex into two parts – everything before the `|`, and everything after the `|`. The first part, .*[a-z].*[0-9].*, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first. The second part, .*[0-9].*[a-z].*, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first. Note, the .* between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.

Email addresses¶

Suppose we want to write a regular expression that matches any string that is an '@ucsd.edu' email address.

Some issues:

In regex, . is the wildcard special character. How do we match the literal '.'?

How do we make sure that the string is only an '@ucsd.edu' email address, and doesn't contain any other characters?

Escaping special characters¶

To match a special character (e.g. . or *) as a literal, place a \ right before it to escape it.
For instance, the regular expression [A-Za-z0-9]+@ucsd\.edu matches '@ucsd.edu' email addresses (assuming email addresses can only contain letters and numbers).
- Note the \., which matches the . literal.

Anchors ⚓️¶

Place ^ at the start of a regex to require that the match string is at the start of the line.
Place $ at the end of a regex to require that the match string is at the end of the line.
For example:
- [A-Za-z0-9]+@ucsd\.edu will match the valid UCSD email in any string.
- ^[A-Za-z0-9]+@ucsd\.edu$ will only match the valid UCSD email in a string if there is nothing else in the string.

Even more regex syntax¶

operation	example	matches ✅	does not match ❌
escape character	`ucsd\.edu`	`'ucsd.edu'`	`'ucsd!edu'`
beginning of line	`^ark`	`'ark two'` `'ark o ark'`	`'dark'`
end of line	`ark$`	`'dark'` `'ark o ark'`	`'ark two'`
zero or one	`cat?`	`'ca'` `'cat'`	`'cart'` (matches `'ca'` only)
built-in character classes*	`\w+` `\d+`	`'billy'` `'231231'`	`'this person'` `'858 people'`
character class negation	`[^a-z]+`	`'KINGTRITON551'` `'1721$$'`	`'porch'` `'billy.edu'`

Note:

\d refers to digits,
\w refers to alphanumeric characters ([A-Z][a-z][0-9]_), and
\s refers to whitespace.

Example 5¶

Write a regular expression that matches any string that:

is between 5 and 10 characters long, and
is made up of only vowels (either uppercase or lowercase, including 'Y' and 'y'), periods, and spaces.

Examples include 'yoo.ee.IOU' and 'AI.I oey'.

✅ Click here to see the answer after you've tried it yourself at regex101.com.

One answer: ^[aeiouyAEIOUY. ]{5,10}$
Key idea: Within a character class (i.e. [...]), special characters do not generally need to be escaped.

Regex in Python¶

`re` in Python¶

The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text.

re.findall takes in a string regex and a string text and returns a list of all matches of regex in text.

re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl.

Capturing groups¶

Surround a regex with ( and ) to define a capturing group within a pattern.
Capturing groups are useful for extracting relevant parts of a string.

Notice what happens if we remove the ( and )!

Earlier, we also saw that parentheses can be used to group parts of a regex together. When using re.findall, all groups are treated as capturing groups.

Example: Log parsing¶

Recall the log string from last lecture.

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s.

While above regex works, it is not very specific. It works on incorrectly formatted log strings.

The more specific, the better!¶

Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
- .* matches every possible string, but we don't use it very often.
A better date extraction regex:
```
\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
```
- \d{2} matches any 2-digit number.
- [A-Z]{1} matches any single occurrence of any uppercase letter.
- [a-z]{2} matches any 2 consecutive occurrences of lowercase letters.
- Remember, special characters ([, ], /) need to be escaped with \.

A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified.

Summary, next time¶

Limitations of regexes¶

Writing a regular expression is like writing a program.

You need to know the syntax well.
They can be easier to write than to read.
They can be difficult to debug.

Regular expressions are terrible at certain types of problems. Examples:

Anything involving counting (same number of instances of a and b).
Anything involving complex structure (palindromes).
Parsing highly complex text structure (HTML, for instance).

Below is a regular expression that validates email addresses in Perl. See this article for more details.

StackOverflow crashed due to regex! See this article for the details.

Summary¶

Regular expressions are used to match and extract patterns from text.
You don't need to force yourself to "memorize" regex syntax – refer to the resources in the Agenda section of the lecture and on the Resources tab of the course website.
Also refer to the three tables of syntax in the lecture:
Note: You don't always have to use regular expressions! If Python/pandas string methods work for your task, you can still use those.
Play Regex Golf to practice! 🏌️
Next time: Using regular expressions in pandas (through .str). Describing text data quantitatively.

Lecture 17 – Regular Expressions¶

DSC 80, Spring 2022¶

Announcements¶

Agenda¶

Regex fundamentals¶

Regular expressions¶

Regex building blocks 🧱¶

Example 1¶

Example 2¶

More regex syntax¶

More regex syntax¶

Example 3¶

Example 4¶

Email addresses¶

Escaping special characters¶

Anchors ⚓️¶

Even more regex syntax¶

Example 5¶

Regex in Python¶

re in Python¶

Capturing groups¶

Example: Log parsing¶

The more specific, the better!¶

Summary, next time¶

Limitations of regexes¶

Summary¶

`re` in Python¶