Lecture 16 – Parsing, Regular Expressions

DSC 80, Spring 2022

Announcements

Agenda

Example: Scraping the HDSI Faculty page

HDSI Faculty page

Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

It seems like the relevant <div>s for faculty are the ones where the data-entry-type attribute is equal to 'individual'. Let's find all of those using find_all.

Within here, we need to extract each faculty member's name. It seems like names are stored in the title attribute within an <a> tag.

We can also extract job titles:

And bios:

Let's create a DataFrame consisting of names and bios for each faculty member.

Now we have a DataFrame!

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an <img> tag.

Example: Scraping quotes

Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.

Specifically, let's try to make a DataFrame that looks like the one below:

quote author author_url tags
0 “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein https://quotes.toscrape.com/author/Albert-Einstein change,deep-thoughts,thinking,world
1 “It is our choices, Harry, that show what we truly are, far more than our abilities.” J.K. Rowling https://quotes.toscrape.com/author/J-K-Rowling abilities,choices
2 “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein https://quotes.toscrape.com/author/Albert-Einstein inspirational,life,live,miracle,miracles

The plan

Eventually, we will create a single function – quote_df – which takes in an integer n and returns a DataFrame with the quotes on the first n pages of https://quotes.toscrape.com/.

To do this, we will define several helper functions:

Key principle: some of our helper functions will make requests, and others will parse, but none will do both!

Aside: f-strings in Python

Downloading a single page

In quote_df, we will call download_page repeatedly – once for i=1, once for i=2, ..., i = n. For now, we will work with just page 5 (chosen arbitrarily).

Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

From this <div>, we can extract the quote, author name, author's URL, and tags.

Let's write an intermediate function, process_quote, which takes in a <div> corresponding to a single quote and returns a Series containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one row of our final DataFrame at a time.

Next, we can write a function that takes in a list of <div>s, calls the above function on each <div> in the list, and returns a DataFrame.

Putting it all together

The elements in the 'tags' column are all strings, but they look like lists. This is not ideal, as we will see shortly.

An extension

We could:

Key takeaways

Nested vs. flat data formats

Nested vs. flat data formats

Example: Scraping quotes, again

Note that for a single quote, we have keys for 'auth_url', 'quote_auth', 'quote_text', 'bio', 'dob', and 'tags'.

Since each line is a separate JSON object, let's read in each line one at a time.

Let's convert the result to a DataFrame.

What data type is the 'tags' column?

Let's save df to a CSV and read it back in.

What data type is the 'tags' column now?

One-hot encoding

Let's write a function that takes in the list of tags (taglist) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

Let's combine this one-hot-encoded DataFrame with df.

If we want all quotes tagged 'inspiration', we can simply query:

Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?

String methods, revisited

Transitioning

Joining on text

Consider the following two DataFrames (see this presentation) for inspiration).

What would happen if we try to merge the two DataFrames on 'department'?

String canonicalization

Now, we can join codes with programs on 'department_clean'.

Reflection

The process of string canonicalization is very brittle.

The limitations of string methods

How can we extract the date and time from the following log string, using just Python string methods?

132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585

Parsing log strings

Alternatively:

That was annoying! Let's see if there's a better way to extract the same information.

Regular expressions

This works...?

🤔🤯

Regular expressions

regex101.com

Regex building blocks 🧱

The four main building blocks for all regexes are shown below (table source, inspiration).

operation order of op. example matches ✅ does not match ❌
concatenation 3 AABAAB AABAAB every other string
or 4 AA|BAAB AA, BAAB every other string
closure
(zero or more)
2 AB*A AA, ABBBBBBA AB, ABABA
parentheses 1 A(A|B)AAB
(AB)*A
AAAAB, ABAAB
A, ABABABABA
every other string
AA, ABBA

Note that |, (, ), and * are special characters, not literals. They manipulate the characters around them.

AB*A matches strings with an 'A', followed by zero or more 'B's, and then an 'A'.

'AA', 'ABA', 'ABBBBBBBBBBBBBBA'
'AB', 'ABAB'

(AB)*A matches strings with zero or more 'AB's, followed by an 'A'.

'A', 'ABA', 'ABABABABA'
'AA', 'ABBBBBBBA', 'ABAB'

Example 1

Write a regular expression that matches 'billy', 'billlly', 'billlllly', etc.



✅ Click here to see the answer after you've tried it yourself at regex101.com. bi(ll)*y will match any even number of 'l's, including 0. To match only a positive even number of 'l's, we'd need to first "fix into place" two 'l's, and then follow that up with zero or more pairs of 'l's. This specifies the regular expression bill(ll)*y.

Example 2

Write a regular expression that matches 'billy', 'billlly', 'biggy', 'biggggy', etc.

Specifically, it should match any string with a positive even number of 'l's in the middle, or a positive even number of 'g's in the middle.



✅ Click here to see the answer after you've tried it yourself at regex101.com. Possible answers: bi(ll(ll)\*|gg(gg)\*)y or bill(ll)\*y|bigg(gg)\*y.
Note, bill(ll)\*|gg(gg)\*y is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)\*, like 'billll', OR strings that match gg(gg)\*y, like 'ggy'.

Summary, next time

Summary