Lecture 15 – Requests and Parsing HTML

DSC 80, Winter 2023

📣 Announcements

Agenda

Data formats

The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.

JSON

JSON data types

Type Description
String Anything inside double quotes.
Number Any number (no difference between ints and floats).
Boolean true and false.
Null JSON's empty value, denoted by null.
Array Like Python lists.
Object A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects).

See json-schema.org for more details.

Example JSON object

See data/family.json.

Aside: eval

eval gone wrong

Observe what happens when we use eval on a string representation of a JSON object:

Using the json module

Let's process the same file using the json module. Recall:

Handling unfamiliar data

APIs and scraping

Programmatic requests

APIs

An API is a service that makes data directly available to the user in a convenient fashion.

Advantages:

Disadvantages:

API terminology

API requests

First, let's make a GET request for 'squirtle'.

Remember, the 200 status code is good! Let's take a look at the content:

Looks like JSON. We can extract the JSON from this request with the json method (or by passing r.text to json.loads).

Let's try a GET request for 'billy'.

Uh oh...

Scraping

Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Advantages:

Disadvantages:

Accessing HTML

Goal: Access information about HDSI faculty members from the HDSI Faculty page.

Let's start by making a GET request to the HDSI Faculty page and see what the resulting HTML looks like.

Wow, that is gross looking! 😰

Best practices for scraping

  1. Send requests slowly and be upfront about what you are doing!
  2. Respect the policy published in the page's robots.txt file.
    • Many sites have a robots.txt file in their root directory, which contains a policy that allows or disallows automatic access to their site.
    • See here or Lab 5, Question 7 for more details.
  3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
  4. Read the Terms of Service for the site and follow it.

Consequences of irresponsible scraping

If you make too many requests:

The anatomy of HTML documents

What is HTML?

For instance, here's the content of a very basic webpage.

Using IPython.display.HTML, we can render it directly in our notebook.

The anatomy of HTML documents

(source)

Useful tags to know

Element Description
<html> the document
<head> the header
<body> the body
<div> a logical division of the document
<span> an inline logical division
<p> a paragraph
<a> an anchor (hyperlink)
<h1>, <h2>, ... header(s)
<img> an image

There are many, many more. See this article for examples.

Tags can have attributes, which further specify how to display information on a webpage.

For instance, <img> tags have src and alt attributes (among others):

<img src="king-selfie.png" alt="A photograph of King Triton." width=500>

Hyperlinks have href attributes:

Click <a href="https://dsc80.com/project3">this link</a> to access Project 3.

What do you think this webpage looks like?

The <div> tag

<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>

Document trees

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are ordered.

What does the DOM tree look like for this document?

Example: Quote scraping

Consider the following webpage.

Parsing HTML using Beautiful Soup

Beautiful Soup 🍜

Example HTML document

To start, we'll work with the source code for an HTML page with the DOM tree shown below:

The string html_string contains an HTML "document".

BeautifulSoup objects

bs4.BeautifulSoup takes in a string or file-like object representing HTML (markup) and returns a parsed document.

Normally, we pass the result of a GET request to bs4.BeautifulSoup, but here we will pass our hand-crafted html_string.

BeautifulSoup objects have several useful attributes, e.g. text:

Traversing through descendants

The descendants attribute traverses a BeautifulSoup tree using depth-first traversal.

Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.

Finding elements in a tree

Practically speaking, you will not use the descendants attribute (or the related children attribute) directly very often. Instead, you will use the following methods:

Using find

Let's try and extract the first <div> subtree.

Let's try and find the <div> element that has an id attribute equal to 'nav'.

find will return the first occurrence of a tag, regardless of its depth in the tree.

Using find_all

find_all returns a list of all matches.

Node attributes

The get method must be called directly on the node that contains the attribute you're looking for.

Summary, next time

Summary

Next time