BeautifulSoup
objects are mutable! See this post on Ed by Trey for more details.Lots and lots of regular expressions! Good resources:
re
library documentation and how-to.contact = '''
Thank you for buying our expensive product!
If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.
If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!
Due to high demand, please allow one-hundred (100) business days for a response.
'''
'(###) ###-####'
.print(contact)
Thank you for buying our expensive product! If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309. If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you! Due to high demand, please allow one-hundred (100) business days for a response.
'(678)'
.'999-8212'
. Let's first write a function that takes in a string and returns whether it looks like an area code.
def is_possibly_area_code(s):
'''Does `s` look like (678)?'''
return len(s) == 5 and s.startswith('(') and s.endswith(')') and s[1:4].isnumeric()
is_possibly_area_code('(123)')
True
is_possibly_area_code('(99)')
False
Let's also write a function that takes in a string and returns whether it looks like the last 7 digits of a phone number.
def is_last_7_phone_number(s):
'''Does `s` look like 999-8212?'''
return len(s) == 8 and s[0:3].isnumeric() and s[3] == '-' and s[4:].isnumeric()
is_last_7_phone_number('999-8212')
True
is_last_7_phone_number('534 1100')
False
Finally, let's split the entire text by spaces, and check whether there are any instances where pieces[i]
looks like an area code and pieces[i+1]
looks like the last 7 digits of a phone number.
# Removes punctuation from the end of each string.
pieces = [s.rstrip('.,?;"\'') for s in contact.split()]
for i in range(len(pieces) - 1):
if is_possibly_area_code(pieces[i]):
if is_last_7_phone_number(pieces[i+1]):
print(pieces[i], pieces[i+1])
(800) 867-5309 (800) 123-4567
print(contact)
Thank you for buying our expensive product! If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309. If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you! Due to high demand, please allow one-hundred (100) business days for a response.
import re
re.findall(r'\(\d{3}\) \d{3}-\d{4}', contact)
['(800) 867-5309', '(800) 123-4567']
\(\d{3}\) \d{3}-\d{4}
describes a pattern that matches US phone numbers of the form '(XXX) XXX-XXXX'
.re
module. We will see how to do so shortly..
, *
, (
, and )
, are special characters.hey
matches the string 'hey'
. The regex he.
also matches the string 'hey'
.The four main building blocks for all regexes are shown below (table source, inspiration).
operation | order of op. | example | matches ✅ | does not match ❌ |
---|---|---|---|---|
concatenation | 3 | AABAAB |
'AABAAB' |
every other string |
or | 4 | AA|BAAB |
'AA' , 'BAAB' |
every other string |
closure (zero or more) |
2 | AB*A |
'AA' , 'ABBBBBBA' |
'AB' , 'ABABA' |
parentheses | 1 | A(A|B)AAB (AB)*A |
'AAAAB' , 'ABAAB' 'A' , 'ABABABABA' |
every other string'AA' , 'ABBA' |
Note that |
, (
, )
, and *
are special characters, not literals. They manipulate the characters around them.
Example (or, parentheses):
DSC 30|80
match?DSC (30|80)
match?Example (closure, parentheses):
blah*
match?(blah)*
match?Write a regular expression that matches 'billy'
, 'billlly'
, 'billlllly'
, etc.
'l'
s, including zero 'l'
s (i.e. 'biy'
).'l'
s.bi(ll)*y
will match any even number of 'l'
s, including 0.
To match only a positive even number of 'l'
s, we'd need to first "fix into place" two 'l'
s, and then follow that up with zero or more pairs of 'l'
s. This specifies the regular expression bill(ll)*y
.
Write a regular expression that matches 'billy'
, 'billlly'
, 'biggy'
, 'biggggy'
, etc.
Specifically, it should match any string with a positive even number of 'l'
s in the middle, or a positive even number of 'g'
s in the middle.
bi(ll(ll)\*|gg(gg)\*)y
or bill(ll)\*y|bigg(gg)\*y
.
bill(ll)\*|gg(gg)\*y
is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)\*
, like 'billll'
, OR strings that match gg(gg)\*y
, like 'ggy'
.
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
wildcard | .U.U.U. |
'CUMULUS' 'JUGULUM' |
'SUCCUBUS' 'TUMULTUOUS' |
character class | [A-Za-z][a-z]* |
'word' 'Capitalized' |
'camelCase' '4illegal' |
at least one | bi(ll)+y |
'billy' 'billlllly' |
'biy' 'bily' |
between $i$ and $j$ occurrences | m[aeiou]{1,2}m |
'mem' 'maam' 'miem' |
'mm' 'mooom' 'meme' |
.
, [
, ]
, +
, {
, and }
are also special characters, in addition to |
, (
, )
, and *
.
Example (character classes, at least one): [A-E]+
is just shortform for (A|B|C|D|E)(A|B|C|D|E)*
.
Example (wildcard):
.
match? he.
match? ...
match?Example (at least one, closure):
123+
match?123*
match?Example (number of occurrences): What does tri{3, 5}
match? Does it match 'triiiii'
?
Example (character classes, number of occurrences): What does [1-6a-f]{3}-[7-9E-S]{2}
match?
Write a regular expression that matches any lowercase string has a repeated vowel, such as 'noon'
, 'peel'
, 'festoon'
, or 'zeebraa'
.
[a-z]\*(aa|ee|ii|oo|uu)[a-z]\*
'aa'
, 'ee'
, 'ii'
, 'oo'
, or 'uu'
in them anywhere. [a-z]\*
means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.
Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'billy80'
, '80!!billy'
, and 'bil8ly0'
.
(.\*[a-z].\*[0-9].\*)|(.\*[0-9].\*[a-z].\*)
.\*[a-z].\*[0-9].\*
, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first.
The second part, .\*[0-9].\*[a-z].\*
, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first.
Note, the .\*
between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.
This is the kind of task that would be easier to accomplish with regular Python string methods.
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
escape character | ucsd\.edu |
'ucsd.edu' |
'ucsd!edu' |
beginning of line | ^ark |
'ark two' 'ark o ark' |
'dark' |
end of line | ark$ |
'dark' 'ark o ark' |
'ark two' |
zero or one | cat? |
'ca' 'cat' |
'cart' (matches 'ca' only) |
built-in character classes* | \w+ \d+ |
'billy' '231231' |
'this person' '858 people' |
character class negation | [^a-z]+ |
'KINGTRITON551' '1721$$' |
'porch' 'billy.edu' |
*Note: in Python's implementation of regex,
\d
refers to digits.\w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
).\s
refers to whitespace.\b
is a word boundary.Example (escaping):
he.
match? he\.
match? (858)
match? \(858\)
match?Example (anchors):
858-534
match?^858-534
match?858-534$
match?*Note: in Python's implementation of regex,
\d
refers to digits.\w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
).\s
refers to whitespace.\b
is a word boundary.\d{3} \d{3}-\d{4}
match?\bcat\b
match? Does it find a match in 'my cat is hungry'
? What about 'concatenate'
?Write a regular expression that matches any string that:
'Y'
and 'y'
), periods, and spaces.Examples include 'yoo.ee.IOU'
and 'AI.I oey'
.
^[aeiouyAEIOUY. ]{5,10}$
[...]
), special characters do not generally need to be escaped.
re
in Python¶The re
package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.
import re
re.search
takes in a string regex
and a string text
and returns the location and substring corresponding to the first match of regex
in text
.
re.search('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
<re.Match object; span=(26, 31), match='ABBBA'>
re.findall
takes in a string regex
and a string text
and returns a list of all matches of regex
in text
. You'll use this most often.
re.findall('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
['ABBBA', 'ABBBBBBBA']
re.sub
takes in a string regex
, a string repl
, and a string text
, and replaces all matches of regex
in text
with repl
.
re.sub('AB*A',
'billy',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
'here is a string for you: billy. here is another: billy'
When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r
before the quotes, e.g. r'exp'
.
re.findall('\bcat\b', 'my cat is hungry')
[]
re.findall(r'\bcat\b', 'my cat is hungry')
['cat']
# Huh?
print('\bcat\b')
cat
(
and )
to define a capture group within a pattern.re.findall(r'\w+@(\w+)\.edu',
'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['notucsd', 'ucsd']
(
and )
!re.findall(r'\w+@\w+\.edu',
'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['billy@notucsd.edu', 'notbilly@ucsd.edu']
re.findall
, all groups are treated as capturing groups.# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')
[('oo', '124')]
pandas
string methods work for your task, you can still use those.pandas
(through .str
).