Lots and lots of regular expressions! Good resources:
re
library documentation and how-to.[1-9][0-9]{2}-[0-9]{3}-[0-9]{4}
matches US phone numbers of the form 'XXX-XXX-XXXX'
.The four main building blocks for all regexes are shown below (table source, inspiration).
operation | order of op. | example | matches ✅ | does not match ❌ |
---|---|---|---|---|
concatenation | 3 | AABAAB |
'AABAAB' |
every other string |
or | 4 | AA|BAAB |
'AA' , 'BAAB' |
every other string |
closure (zero or more) |
2 | AB*A |
'AA' , 'ABBBBBBA' |
'AB' , 'ABABA' |
parentheses | 1 | A(A|B)AAB (AB)*A |
'AAAAB' , 'ABAAB' 'A' , 'ABABABABA' |
every other string'AA' , 'ABBA' |
Note that |
, (
, )
, and *
are special characters, not literals. They manipulate the characters around them.
Example: AB*A
matches strings with an 'A'
, followed by zero or more 'B'
s, and then an 'A'
.
✅ 'AA'
, 'ABA'
, 'ABBBBBBBBBBBBBBA'
❌ 'AB'
, 'ABAB'
Example: (AB)*A
matches strings with zero or more 'AB'
s, followed by an 'A'
.
✅ 'A'
, 'ABA'
, 'ABABABABA'
❌ 'AA'
, 'ABBBBBBBA'
, 'ABAB'
Write a regular expression that matches 'billy'
, 'billlly'
, 'billlllly'
, etc.
'l'
s, including zero 'l'
s (i.e. 'biy'
).'l'
s.bi(ll)*y
will match any even number of 'l'
s, including 0.
To match only a positive even number of 'l'
s, we'd need to first "fix into place" two 'l'
s, and then follow that up with zero or more pairs of 'l'
s. This specifies the regular expression bill(ll)*y
.
Write a regular expression that matches 'billy'
, 'billlly'
, 'biggy'
, 'biggggy'
, etc.
Specifically, it should match any string with a positive even number of 'l'
s in the middle, or a positive even number of 'g'
s in the middle.
bi(ll(ll)*|gg(gg)*)y
or bill(ll)*y|bigg(gg)*y
.
bill(ll)*|gg(gg)*y
is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)*
, like 'billll'
, OR strings that match gg(gg)*y
, like 'ggy'
.
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
wildcard | .U.U.U. |
'CUMULUS' 'JUGULUM' |
'SUCCUBUS' 'TUMULTUOUS' |
character class | [A-Za-z][a-z]* |
'word' 'Capitalized' |
'camelCase' '4illegal' |
at least one | bi(ll)+y |
'billy' 'billlllly' |
'biy' 'bily' |
between a and b occurrences | m[aeiou]{1,2}m |
'mem' 'maam' 'miem' |
'mm' 'mooom' 'meme' |
.
, [
, ]
, +
, {
, and }
are also special characters, in addition to |
, (
, )
, and *
.
Example: [A-E]+
is just shortform for (A|B|C|D|E)(A|B|C|D|E)*
.
Write a regular expression that matches any lowercase string has a repeated vowel, such as 'noon'
, 'peel'
, 'festoon'
, or 'zeebraa'
.
[a-z]*(aa|ee|ii|oo|uu)[a-z]*
'aa'
, 'ee'
, 'ii'
, 'oo'
, or 'uu'
in them anywhere. [a-z]*
means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.
Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'billy80'
, '80!!billy'
, and 'bil8ly0'
.
(.*[a-z].*[0-9].*)|(.*[0-9].*[a-z].*)
.*[a-z].*[0-9].*
, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first.
The second part, .*[0-9].*[a-z].*
, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first.
Note, the .*
between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.
Suppose we want to write a regular expression that matches any string that is an '@ucsd.edu'
email address.
Some issues:
.
is the wildcard special character. How do we match the literal '.'
?'@ucsd.edu'
email address, and doesn't contain any other characters?To match a special character (e.g. .
or *
) as a literal, place a \
right before it to escape it.
For instance, the regular expression [A-Za-z0-9]+@ucsd\.edu
matches '@ucsd.edu'
email addresses (assuming email addresses can only contain letters and numbers).
\.
, which matches the .
literal.^
at the start of a regex to require that the match string is at the start of the line.$
at the end of a regex to require that the match string is at the end of the line.[A-Za-z0-9]+@ucsd\.edu
will match the valid UCSD email in any string.^[A-Za-z0-9]+@ucsd\.edu$
will only match the valid UCSD email in a string if there is nothing else in the string.operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
escape character | ucsd\.edu |
'ucsd.edu' |
'ucsd!edu' |
beginning of line | ^ark |
'ark two' 'ark o ark' |
'dark' |
end of line | ark$ |
'dark' 'ark o ark' |
'ark two' |
zero or one | cat? |
'ca' 'cat' |
'cart' (matches 'ca' only) |
built-in character classes* | \w+ \d+ |
'billy' '231231' |
'this person' '858 people' |
character class negation | [^a-z]+ |
'KINGTRITON551' '1721$$' |
'porch' 'billy.edu' |
Note:
\d
refers to digits, \w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
), and \s
refers to whitespace.Write a regular expression that matches any string that:
'Y'
and 'y'
), periods, and spaces.Examples include 'yoo.ee.IOU'
and 'AI.I oey'
.
^[aeiouyAEIOUY. ]{5,10}$
[...]
), special characters do not generally need to be escaped.
re
in Python¶The re
package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.
import re
re.search
takes in a string regex
and a string text
and returns the location and substring corresponding to the first match of regex
in text
.
re.search('AB*A', 'here is a string for you: ABBBA. here is another: ABBBBBBBA')
<re.Match object; span=(26, 31), match='ABBBA'>
re.findall
takes in a string regex
and a string text
and returns a list of all matches of regex
in text
.
re.findall('AB*A', 'here is a string for you: ABBBA. here is another: ABBBBBBBA')
['ABBBA', 'ABBBBBBBA']
re.sub
takes in a string regex
, a string repl
, and a string text
, and replaces all matches of regex
in text
with repl
.
re.sub('AB*A', 'billy', 'here is a string for you: ABBBA. here is another: ABBBBBBBA')
'here is a string for you: billy. here is another: billy'
(
and )
to define a capturing group within a pattern.re.findall(r'\w+@(\w+)\.edu', 'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['notucsd', 'ucsd']
(
and )
!re.findall(r'\w+@\w+\.edu', 'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['billy@notucsd.edu', 'notbilly@ucsd.edu']
re.findall
, all groups are treated as capturing groups.# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')
[('oo', '124')]
Recall the log string from last lecture.
s = '''132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''
Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s
.
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)
[('05', 'May', '2022', '14', '26', '15')]
While above regex works, it is not very specific. It works on incorrectly formatted log strings.
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)
[('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')]
Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
.*
matches every possible string, but we don't use it very often.A better date extraction regex:
\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
\d{2}
matches any 2-digit number.[A-Z]{1}
matches any single occurrence of any uppercase letter.[a-z]{2}
matches any 2 consecutive occurrences of lowercase letters.[
, ]
, /
) need to be escaped with \
.s
'132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)
[('05', 'May', '2022', '14', '26', '15')]
A benefit of new_exp
over exp
is that it doesn't capture anything when the string doesn't follow the format we specified.
other_s
'[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(new_exp, other_s)
[]
Writing a regular expression is like writing a program.
Regular expressions are terrible at certain types of problems. Examples:
Below is a regular expression that validates email addresses in Perl. See this article for more details.
StackOverflow crashed due to regex! See this article for the details.
pandas
string methods work for your task, you can still use those.pandas
(through .str
). Describing text data quantitatively.