In [ ]:

from lec_utils import *

# For the JSON evaluation example.
def err():
    raise ValueError('i just deleted all your files lol 😂')

Lecture 9 – HTTP¶

Introduction to HTTP¶

No description has been provided for this image

Data sources¶

Often, the data you need doesn't exist in "clean" .csv files.
Solution: Collect your own data!
- Design and administer your own survey or run an experiment.
- Find related data on the internet.

The internet contains massive amounts of historical record; for most questions you can think of, the answer exists somewhere on the internet.

Collecting data from the internet¶

There are two ways to programmatically access data on the internet:
- through an API.
- by scraping.
We will discuss the differences between both approaches, but for now, the important part is that they both use HTTP.

HTTP¶

HTTP stands for Hypertext Transfer Protocol.
- It was developed in 1989 by Tim Berners-Lee (and friends).
It is a request-response protocol.
- Protocol = set of rules.
HTTP allows...
- computers to talk to each other over a network.
- devices to fetch data from "web servers."
The "S" in HTTPS stands for "secure".

UCSD was a node in ARPANET, the predecessor to the modern internet (source).

The request-response model¶

HTTP follows the request-response model.

A request is made by the client.
A response is returned by the server.
Example: YouTube search 🎥.
- Consider the following URL: https://www.youtube.com/results?search_query=apple+vision+pro.
- Your web browser, a client, makes an HTTP request with a search query.
- The server, YouTube, is a computer that is sitting somewhere else.
- The server returns a response that contains the search results.
- Note: ?search_query=apple+vision+pro is called a "query string."

Request methods¶

The request methods you will use most often are GET and POST; see Mozilla's web docs for a detailed list of request methods.

GET is used to request data from a specified resource.
POST is used to send data to the server.
- For example, uploading a photo to Instagram or entering credit card information on Amazon.

Example `GET` request¶

Below is an example GET HTTP request made by a browser when accessing datascience.ucsd.edu.

GET / HTTP/1.1
Connection: keep-alive
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
sec-ch-ua: "Chromium";v="121", "Not A(Brand";v="99"
sec-ch-ua-platform: "macOS"

The first line (GET / HTTP/1.1) is called the "request line", and the lines afterwards are called "header fields". Header fields contain metadata.
We could also provide a "body" after the header fields.
To see HTTP requests in Google Chrome, follow these steps.

Example `GET` response¶

The response below was generated by executing the request on the previous slide.

HTTP/1.1 200 OK
Date: Sun, 04 Feb 2024 17:35:01 GMT
Server: Apache
X-Powered-By: PHP/7.4.33
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/113>; rel="alternate"; type="application/json"
...

<html lang="en-US">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <link rel="profile" href="https://gmpg.org/xfn/11"/>
        <title>Halıcıoğlu Data Science Institute &#8211;UC San Diego</title>
        <script>
...

Consequences of the request-response model¶

When a request is sent to view content on a webpage, the server must:
- process your request (i.e. prepare data for the response).
- send content back to the client in its response.
Remember, servers are computers.
- Someone has to pay to keep these computers running.
- This means that every time you access a website, someone has to pay.

Making HTTP requests¶

There are (at least) two ways to make HTTP requests outside of a browser:

From the command line, with curl.
From Python, with the requests package.

Making HTTP requests using `requests`¶

requests is a Python module that allows you to use Python to interact with the internet!
There are other packages that work similarly (e.g. urllib), but requests is arguably the easiest to use.

In [ ]:

import requests

Example: `GET` requests via `requests`¶

For instance, let's access the source code of the UCSD homepage, https://ucsd.edu.

In [ ]:

res = requests.get('https://ucsd.edu')

res is now a Response object.

In [ ]:

res

The text attribute of res is a string that containing the entire response.

In [ ]:

type(res.text)

In [ ]:

len(res.text)

In [ ]:

print(res.text[:1000])

Example: `POST` requests via `requests`¶

The following call to requests.post makes a post request to https://httpbin.org/post, with a 'name' parameter of 'King Triton'.

In [ ]:

post_res = requests.post('https://httpbin.org/post',
                         data={'name': 'King Triton'})
post_res

In [ ]:

post_res.text

In [ ]:

# More on this shortly!
post_res.json()

What happens when we try and make a POST request somewhere where we're unable to?

In [ ]:

yt_res = requests.post('https://youtube.com',
                       data={'name': 'King Triton'})
yt_res

In [ ]:

yt_res.text

yt_res.text is a string containing HTML – we can render this in-line using IPython.display.HTML.

In [ ]:

from IPython.display import HTML

In [ ]:

HTML(yt_res.text)

HTTP status codes¶

When we request data from a website, the server includes an HTTP status code in the response.

The most common status code is 200, which means there were no issues.

Other times, you will see a different status code, describing some sort of event or error.
- Common examples: 400 – bad request, 404 – page not found, 500 – internal server error.
- The first digit of a status describes its general "category".

See https://httpstat.us for a list of all HTTP status codes.
- It also has example sites for each status code; for example, https://httpstat.us/404 returns a 404.

In [ ]:

yt_res.status_code

In [ ]:

# ok checks if the result was successful.
yt_res.ok

Handling unsuccessful requests¶

Unsuccessful requests can be re-tried, depending on the issue.
- A good first step is to wait a little, then try again.
A common issue is that you're making too many requests to a particular server at a time – if this is the case, increase the time between each request. You can even do this programatically, say, using time.sleep.
See the textbook for more examples.

In [ ]:

Data formats¶

The data formats of the internet¶

Responses typically come in one of two formats: HTML or JSON.

The response body of a GET request is usually either JSON (when using an API) or HTML (when accessing a webpage).
The response body of a POST request is usually JSON.
XML is also a common format, but not as popular as it once was.

JSON¶

JSON stands for JavaScript Object Notation. It is a lightweight format for storing and transferring data.
It is:
- very easy for computers to read and write.
- moderately easy for programmers to read and write by hand.
- meant to be generated and parsed.
Most modern languages have an interface for working with JSON objects.
- JSON objects resemble Python dictionaries (but are not the same!).

JSON data types¶

Type	Description
String	Anything inside double quotes.
Number	Any number (no difference between ints and floats).
Boolean	`true` and `false`.
Null	JSON's empty value, denoted by `null`.
Array	Like Python lists.
Object	A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects).

See json-schema.org for more details.

Example JSON object¶

See data/family.json.

In [ ]:

import json
from pathlib import Path

f = Path('data') / 'family.json'
family_tree = json.loads(f.read_text())

In [ ]:

family_tree

In [ ]:

family_tree['children'][1]['children'][0]['age']

Aside: `eval`¶

eval, which stands for "evaluate", is a function built into Python.
It takes in a string containing a Python expression and evaluates it in the current context.

In [ ]:

x = 4
eval('x + 5')

It seems like eval can do the same thing that json.loads does...

In [ ]:

eval(f.read_text())

But you should almost never use eval. The next slide demonstrates why.

`eval` gone wrong¶

Observe what happens when we use eval on a string representation of a JSON object:

In [ ]:

f_other = Path('data') / 'evil_family.json'
eval(f_other.read_text())

Oh no! Since evil_family.json, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.
This happened because eval evaluates all parts of the input string as if it were Python code.
You never need to do this – instead, use the .json() method of a response object, or use the json library.

Using the `json` module¶

Let's process the same file using the json module. Note:

json.load(f) loads a JSON file from a file object.
json.loads(f) loads a JSON file from a string.

In [ ]:

f_other = Path('data') / 'evil_family.json'
s = f_other.read_text()
s

In [ ]:

json.loads(s)

Since util.err() is not a string in JSON (there are no quotes around it), json.loads is not able to parse it as a JSON object.
This "safety check" is intentional.

Handling unfamiliar data¶

Never trust data from an unfamiliar site.
Never use eval on "raw" data that you didn't create!
The JSON data format needs to be parsed, not evaluated as a dictionary.
- It was designed with safety in mind!

Aside: `pd.read_json`¶

pandas also has a built-in read_json function.

In [ ]:

pd.read_json(f)

It only makes sense to use it, though, when you have a JSON file that has some sort of tabular structure. Our family tree example does not.

Question 🤔

Use ChatGPT to give you examples of inputs to pd.read_json() where the JSON reads successfully and unsuccessfully. What do you learn about pd.read_json()?

APIs and scraping¶

Programmatic requests¶

We learned how to use the Python requests package to exchange data via HTTP.
- GET requests are used to request data from a server.
- POST requests are used to send data to a server.
There are two ways of collecting data through a request:
- By using a published API (application programming interface).
- By scraping a webpage to collect its HTML source code.

APIs¶

An application programming interface (API) is a service that makes data directly available to the user in a convenient fashion.

Advantages:

The data are usually clean, up-to-date, and ready to use.
The presence of a API signals that the data provider is okay with you using their data.
The data provider can plan and regulate data usage.
- Some APIs require you to create an API "key", which is like an account for using the API.
- APIs can also give you access to data that isn't publicly available on a webpage.

Big disadvantage: APIs don't always exist for the data you want!

API terminology¶

A URL, or uniform resource locator, describes the location of a website or resource.
An API endpoint is a URL of the data source that the user wants to make requests to.
For example, on the Reddit API:
- the /comments endpoint retrieves information about comments.
- the /hot endpoint retrieves data about posts labeled "hot" right now.
- To access these endpoints, you add the endpoint name to the base URL of the API.

API requests¶

API requests are just GET/POST requests to a specially maintained URL.
Let's test out the Pokémon API.

First, let's make a GET request for 'squirtle'. To do this, we need to make a request to the correct URL.

In [ ]:

def create_url(pokemon):
    return f'https://pokeapi.co/api/v2/pokemon/{pokemon}'

create_url('squirtle')

In [ ]:

r = requests.get(create_url('squirtle'))
r

Remember, the 200 status code is good! Let's take a look at the content:

In [ ]:

r.content[:1000]

Looks like JSON. We can extract the JSON from this request with the json method (or by passing r.text to json.loads).

In [ ]:

rr = r.json()
rr.keys()

In [ ]:

rr['weight']

In [ ]:

rr['abilities'][1]['ability']['name']

Let's try a GET request for 'billy'.

In [ ]:

r = requests.get(create_url('billy'))
r

We receive a 404 error, since there is no Pokemon named 'billy'!

Scraping¶

Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Big advantage: You can always do it! For example, Google scrapes webpages in order to make them searchable.

Disadvantages:

It is often difficult to parse and clean scraped data.
- Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).
Websites can change often, so scraping code can get outdated quickly.
Websites may not want you to scrape their data!
In general, we prefer APIs, but scraping is a useful skill to learn.

Example: Scraping the HDSI faculty page¶

To fully understand how to scrape, we need to understand how HTML documents are structured and how to extract information out of them.

But as a preview of what's to come next week, let's start by making a request to the HDSI Faculty page, https://datascience.ucsd.edu/faculty.

In [ ]:

# Sometimes, the requests library gets weirdly strict about the HDSI webpage,
# so we'll skip its security checks using verify=False.
fac_response = requests.get('https://datascience.ucsd.edu/faculty/', verify=False)
fac_response

The response is a long HTML document.

In [ ]:

len(fac_response.text)

In [ ]:

print(fac_response.text[:1000])

Question 🤔

Try asking ChatGPT to get the faculty names from this webpage by pasting the HTML of fac_response into the prompt. What happens? Can you figure out a way to get around this?

To parse HTML, we'll use the BeautifulSoup library.

In [ ]:

from bs4 import BeautifulSoup
soup = BeautifulSoup(fac_response.text)

Now, soup is a representation of the faculty page's HTML code that Python knows how to extract information from.

In [ ]:

# Magic that we'll learn how to create together next Tuesday.
divs = soup.find_all('div', class_='vc_grid-item')
names = [div.find('h4').text for div in divs]
titles = [div.find(class_='pendari_people_title').text for div in divs]

faculty = pd.DataFrame({
    'name': names, 
    'title': titles, 
})
faculty.head()

Now we have a DataFrame!

In [ ]:

faculty[faculty['title'].str.contains('Lecturer') | faculty['title'].str.contains('Teaching')]

What if we want to get faculty members' pictures?

In [ ]:

from IPython.display import Image, display

def show_picture(name):
    idx = faculty[faculty['name'].str.lower().str.contains(name.lower())].index[0]
    display(Image(url=divs[idx].find('img')['src'], width=200, height=200))

In [ ]:

show_picture('sam')

Lecture 9 – HTTP¶

Introduction to HTTP¶

Data sources¶

Collecting data from the internet¶

HTTP¶

The request-response model¶

Request methods¶

Example GET request¶

Example GET response¶

Consequences of the request-response model¶

Making HTTP requests¶

Making HTTP requests¶

Making HTTP requests using requests¶

Example: GET requests via requests¶

Example: POST requests via requests¶

HTTP status codes¶

Handling unsuccessful requests¶

Data formats¶

The data formats of the internet¶

JSON¶

JSON data types¶

Example JSON object¶

Aside: eval¶

eval gone wrong¶

Using the json module¶

Handling unfamiliar data¶

Aside: pd.read_json¶

Question 🤔

APIs and scraping¶

Programmatic requests¶

APIs¶

API terminology¶

API requests¶

Scraping¶

Example: Scraping the HDSI faculty page¶

Question 🤔

Example `GET` request¶

Example `GET` response¶

Making HTTP requests using `requests`¶

Example: `GET` requests via `requests`¶

Example: `POST` requests via `requests`¶

Aside: `eval`¶

`eval` gone wrong¶

Using the `json` module¶

Aside: `pd.read_json`¶