Lecture 14 – HTTP Basics

DSC 80, Winter 2023

📣 Announcements

Agenda

Recap: Imputation

Example: Heights 🧍📏

Mean imputation

Suppose the 'child' column has missing values.

Conditional mean imputation of MAR data

The pink distribution (conditional mean imputation) does a better job of approximating the turquoise distribution (the full dataset with no missing values) than the purple distribution (single mean imputation).

Probabilistic imputation

Suppose the 'child' column has missing values.

Conditional probabilistic imputation of MAR data

Let's use transform to call create_imputed separately on each 'gender'.

The green distribution (conditional probabilistic imputation) does the best job of approximating the turquoise distribution (the full dataset with no missing values)!

Remember that the graph above is interactive – you can hide/show lines by clicking them in the legend.

Randomness

Multiple imputation of MCAR data

Steps:

  1. Start with observed and incomplete data.
  1. Create $m$ imputed versions of the data through a probabilistic procedure.
    • The imputed datasets are identical for the observed data entries.
    • They differ in the imputed values.
    • The differences reflect our uncertainty about what value to impute.
  1. Then, compute parameter estimates on each imputed dataset.
    • For instance, the mean, standard deviation, median, etc.
  1. Finally, pool the $m$ parameter estimates into one estimate.

Multiple imputation of MCAR data

Let's try this procedure out on the heights_mcar dataset.

Each time we run the following cell, it generates a new imputed version of the 'child' column.

Let's run the above procedure 100 times.

Let's plot some of the imputed columns on the previous slide.

Let's look at the distribution of means across the imputed columns.

Summary of imputation techniques

See the end of Lecture 13 for a detailed summary of all imputation techniques that we've seen so far.

Introduction to HTTP

The material we're covering now is not on the Midterm Exam.

Data sources

Collecting data from the internet

HTTP

UCSD was a node in ARPANET, the predecessor to the modern internet (source).

The request-response model

HTTP follows the request-response model.

Request methods

The request methods you will use most often are GET and POST; see Mozilla's web docs for a detailed list of request methods.

Example GET request

Below is an example GET HTTP request made by a browser when accessing datascience.ucsd.edu.

GET / HTTP/1.1
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Connection: keep-alive
Accept-Language: en-US,en;q=0.9

Example GET response

The response below was generated by executing the request on the previous slide.

HTTP/1.1 200 OK
Date: Fri, 29 Apr 2022 02:54:41 GMT
Server: Apache
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/2427>; rel="alternate"; type="application/json"
Link: <https://datascience.ucsd.edu/>; rel=shortlink
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html>
<html lang="en-US">
<head>
    <meta charset="UTF-8">
    <link rel="profile" href="https://gmpg.org/xfn/11">
    <style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none
...

Consequences of the request-response model

Example: istheshipstuck.com

Read Inside a viral website, an account of what it's like to run a site that gained 50 million+ views in 5 days.

Making HTTP requests

Making HTTP requests

We'll see two ways to make HTTP requests outside of a browser:

Making HTTP requests using curl

curl is a command-line tool that sends HTTP requests, like a browser.

  1. The client, curl, sends a HTTP request.
  2. The request contains a method (e.g. GET or POST).
  3. The HTTP server responds with:
    • a status line, indicating if things went well,
    • response headers, and
    • (usually) a response body, containing the requested data.

Example: GET requests via curl

# `-v` is short for verbose
curl -v https://httpbin.org/html

Queries in a GET request

https://www.google.com/search?q=ucsd+dsc+80+hard&client=safari

Making HTTP requests using requests

Example: GET requests via requests

To access the source code of the UCSD home page, all we need to run is the following:

requests.get('https://ucsd.edu').text

res is now a Response object.

The text attribute of res is a string that containing the entire response.

Example: POST requests via requests

The following call to requests.post makes a post request to https://httpbin.org/post, with a 'name' parameter of 'King Triton'.

What happens when we try and make a POST request somewhere where we're unable to?

yt_res.text is a string containing HTML – we can render this in-line using IPython.display.HTML.

HTTP status codes

Successful requests ✅

Data formats

The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.

JSON

JSON data types

Type Description
String Anything inside double quotes.
Number Any number (no difference between ints and floats).
Boolean true and false.
Null JSON's empty value, denoted by null.
Array Like Python lists.
Object A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects).

See json-schema.org for more details.

Example JSON object

See data/family.json.

Aside: eval

eval gone wrong

Observe what happens when we use eval on a string representation of a JSON object:

Using the json module

Let's process the same file using the json module. Recall:

Handling unfamiliar data

Summary, next time

Summary

Next time