Lecture 17 – Regular Expressions

DSC 80, Spring 2022

Announcements

Agenda

Lots and lots of regular expressions! Good resources:

See dsc80.com/resources/#regular-expressions.

Regex fundamentals

Regular expressions

Regex building blocks 🧱

The four main building blocks for all regexes are shown below (table source, inspiration).

operation order of op. example matches ✅ does not match ❌
concatenation 3 AABAAB 'AABAAB' every other string
or 4 AA|BAAB 'AA', 'BAAB' every other string
closure
(zero or more)
2 AB*A 'AA', 'ABBBBBBA' 'AB', 'ABABA'
parentheses 1 A(A|B)AAB
(AB)*A
'AAAAB', 'ABAAB'
'A', 'ABABABABA'
every other string
'AA', 'ABBA'

Note that |, (, ), and * are special characters, not literals. They manipulate the characters around them.

Example: AB*A matches strings with an 'A', followed by zero or more 'B's, and then an 'A'.

'AA', 'ABA', 'ABBBBBBBBBBBBBBA'
'AB', 'ABAB'

Example: (AB)*A matches strings with zero or more 'AB's, followed by an 'A'.

'A', 'ABA', 'ABABABABA'
'AA', 'ABBBBBBBA', 'ABAB'

Example 1

Write a regular expression that matches 'billy', 'billlly', 'billlllly', etc.



✅ Click here to see the answer after you've tried it yourself at regex101.com. bi(ll)*y will match any even number of 'l's, including 0. To match only a positive even number of 'l's, we'd need to first "fix into place" two 'l's, and then follow that up with zero or more pairs of 'l's. This specifies the regular expression bill(ll)*y.

Example 2

Write a regular expression that matches 'billy', 'billlly', 'biggy', 'biggggy', etc.

Specifically, it should match any string with a positive even number of 'l's in the middle, or a positive even number of 'g's in the middle.


✅ Click here to see the answer after you've tried it yourself at regex101.com. Possible answers: bi(ll(ll)*|gg(gg)*)y or bill(ll)*y|bigg(gg)*y.
Note, bill(ll)*|gg(gg)*y is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)*, like 'billll', OR strings that match gg(gg)*y, like 'ggy'.

More regex syntax

More regex syntax

operation example matches ✅ does not match ❌
wildcard .U.U.U. 'CUMULUS'
'JUGULUM'
'SUCCUBUS'
'TUMULTUOUS'
character class [A-Za-z][a-z]* 'word'
'Capitalized'
'camelCase'
'4illegal'
at least one bi(ll)+y 'billy'
'billlllly'
'biy'
'bily'
between a and b occurrences m[aeiou]{1,2}m 'mem'
'maam'
'miem'
'mm'
'mooom'
'meme'

., [, ], +, {, and } are also special characters, in addition to |, (, ), and *.

Example: [A-E]+ is just shortform for (A|B|C|D|E)(A|B|C|D|E)*.

Example 3

Write a regular expression that matches any lowercase string has a repeated vowel, such as 'noon', 'peel', 'festoon', or 'zeebraa'.


✅ Click here to see the answer after you've tried it yourself at regex101.com. One answer: [a-z]*(aa|ee|ii|oo|uu)[a-z]*
This regular expression matches strings of lowercase characters that have 'aa', 'ee', 'ii', 'oo', or 'uu' in them anywhere. [a-z]* means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.

Example 4

Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'billy80', '80!!billy', and 'bil8ly0'.


✅ Click here to see the answer after you've tried it yourself at regex101.com. One answer: (.*[a-z].*[0-9].*)|(.*[0-9].*[a-z].*)
We can break the above regex into two parts – everything before the `|`, and everything after the `|`. The first part, .*[a-z].*[0-9].*, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first. The second part, .*[0-9].*[a-z].*, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first. Note, the .* between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.

Email addresses

Suppose we want to write a regular expression that matches any string that is an '@ucsd.edu' email address.

Some issues:

Escaping special characters

Anchors ⚓️

Even more regex syntax

operation example matches ✅ does not match ❌
escape character ucsd\.edu 'ucsd.edu' 'ucsd!edu'
beginning of line ^ark 'ark two'
'ark o ark'
'dark'
end of line ark$ 'dark'
'ark o ark'
'ark two'
zero or one cat? 'ca'
'cat'
'cart' (matches 'ca' only)
built-in character classes* \w+
\d+
'billy'
'231231'
'this person'
'858 people'
character class negation [^a-z]+ 'KINGTRITON551'
'1721$$'
'porch'
'billy.edu'

Note:

Example 5

Write a regular expression that matches any string that:

Examples include 'yoo.ee.IOU' and 'AI.I oey'.


✅ Click here to see the answer after you've tried it yourself at regex101.com. One answer: ^[aeiouyAEIOUY. ]{5,10}$
Key idea: Within a character class (i.e. [...]), special characters do not generally need to be escaped.

Regex in Python

re in Python

The re package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

re.search takes in a string regex and a string text and returns the location and substring corresponding to the first match of regex in text.

re.findall takes in a string regex and a string text and returns a list of all matches of regex in text.

re.sub takes in a string regex, a string repl, and a string text, and replaces all matches of regex in text with repl.

Capturing groups

Example: Log parsing

Recall the log string from last lecture.

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s.

While above regex works, it is not very specific. It works on incorrectly formatted log strings.

The more specific, the better!

A benefit of new_exp over exp is that it doesn't capture anything when the string doesn't follow the format we specified.

Summary, next time

Limitations of regexes

Writing a regular expression is like writing a program.

Regular expressions are terrible at certain types of problems. Examples:

Below is a regular expression that validates email addresses in Perl. See this article for more details.

StackOverflow crashed due to regex! See this article for the details.

Summary