Lecture 15 – Requests and Parsing HTML

DSC 80, Spring 2022

Announcements

Agenda

APIs and web scraping

Programmatic requests

APIs

API terminology

API requests

First, let's make a GET request for 'squirtle'.

Remember, the 200 status code is good! Let's take a look at the content:

Looks like JSON. We can extract the JSON from this request with the json method (or by passing r.text to json.loads).

Let's try a GET request for 'billy'.

Uh oh...

Scraping

Accessing HTML

Let's make a GET request to the HDSI Faculty page and see what the resulting HTML looks like.

Wow, that is gross looking! 😰

Best practices for scraping

  1. Send requests slowly and be upfront about what you are doing!
  2. Respect the policy published in the page's robots.txt file.
    • Many sites have a robots.txt file in their root directory, which contains a policy that allows or disallows automatic access to their site.
    • See here or Lab 5, Question 5 for more details.
  3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
  4. Read the Terms of Service for the site and follow it.

Consequences of irresponsible scraping

If you make too many requests:

The anatomy of HTML documents

What is HTML?

The anatomy of HTML documents

(source)

Useful tags to know

Element Description
<html> the document
<head> the header
<body> the body
<div> a logical division of the document
<span> an in-line logical division
<p> a paragraph
<a> an anchor (hyper-link)
<h1>, <h2>, ... header(s)
<img> an image

There are many, many more. See this article for examples.

Tags can have attributes, which further specify how to display information on a webpage.

For instance, <img> tags have src and alt attributes (among others):

<img src="billy-selfie.png" alt="A photograph of Billy." width=500>

Hyperlinks have href attributes:

Click <a href="https://dsc80.com/project3">this link</a> to access Project 3.

What do you think this webpage looks like?

The <div> tag

<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>

Document trees

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are ordered.

What does the DOM tree look like for this document?

Example: Quote scraping

Consider the following webpage.

Parsing HTML via Beautiful Soup

Beautiful Soup 🍜

Example HTML document

To start, let's instantiate a BeautifulSoup object, using the source code for an HTML page with the DOM tree shown below:

The string html_string contains an HTML "document".

Using the HTML function in the IPython.display module, we can render an HTML document from within our Jupyter Notebook:

BeautifulSoup objects

bs4.BeautifulSoup takes in a string or file-like object representing HTML (markup) and returns a parsed document.

Normally, we pass the result of a GET request to bs4.BeautifulSoup, but here we will pass our hand-crafted html_string.

BeautifulSoup objects have several useful attributes, e.g. text:

Child nodes

Aside: iterators

On the previous slide, we saw that that soup.children isn't another BeautifulSoup object, but rather something of the form <list_iterator at 0x7f7b0ab8c370>.

What are iterators, again?

Child nodes

The children attribute returns an iterator so that it doesn't have to load the entire DOM tree in memory.

Depth-first traversal through descendants

Finding elements in a tree

Practically speaking, you will not use the children or descendants attributes directly very often. Instead, you will use the following methods:

Using find

Let's try and extract the first <div> subtree.

Let's try and find the <div> element that has an id attribute equal to 'nav'.

find will return the first occurrence of a tag, regardless of what depth it is in the tree.

Using find_all

find_all returns a list of all matches.

text is a node attribute.

Node attributes

You can access tags using attribute notation, too.

Example: Scraping the HDSI Faculty page

Example

Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

It seems like the relevant <div>s for faculty are the ones where the data-entry-type attribute is equal to 'individual'. Let's find all of those.

Within here, we need to extract each faculty member's name. It seems like names are stored in the title attribute within an <a> tag.

We can also extract job titles:

And bios:

Let's create a DataFrame consisting of names and bios for each faculty member.

Now we have a DataFrame!

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an <img> tag.

Summary, next time

Summary