import pandas as pd
import numpy as np
import os
import util
import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'
heights = pd.read_csv(os.path.join('data', 'midparent.csv'))
heights = (
heights
.rename(columns={'childHeight': 'child', 'childNum': 'number'})
.drop('midparentHeight', axis=1)
)
heights.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mar = util.make_mar_on_cat(heights, 'child', 'gender', pct=0.5)
Suppose the 'child'
column has missing values.
'child'
is MCAR, then fill in each of the missing values using the mean of the observed values.'child'
is MAR dependent on a categorical column, then fill in each of the missing values using the mean of the observed values in each category. For instance, if 'child'
is MAR dependent on 'gender'
, we can fill in:'child'
heights with the observed mean for female children, and'child'
heights with the observed mean for male children.'child'
is MAR dependent on a numerical column, then bin the numerical column to make it categorical, then follow the procedure above. See Lab 5, Question 5!def mean_impute(ser):
return ser.fillna(ser.mean())
heights_mar_cond = heights_mar.groupby('gender')['child'].transform(mean_impute).to_frame() # Conditional mean imputation (good, since MAR).
heights_mar_mfilled = heights_mar.fillna(heights_mar['child'].mean()) # Single mean imputation (bad, since MAR).
df_map = {'Original': heights, 'MAR, Unfilled': heights_mar,
'MAR, Mean Imputed': heights_mar_mfilled, 'MAR, Conditional Mean Imputed': heights_mar_cond}
util.multiple_kdes(df_map)
The pink distribution (conditional mean imputation) does a better job of approximating the turquoise distribution (the full dataset with no missing values) than the purple distribution (single mean imputation).
Suppose the 'child'
column has missing values.
'child'
is MCAR, then fill in each of the missing values with randomly selected observed 'child'
heights.'child'
values, pick 5 of the not-missing 'child'
values.'child'
is MAR dependent on a categorical column, sample from the observed values separately for each category.def create_imputed(col):
col = col.copy()
# Find the number of missing child heights for that gender.
num_null = col.isna().sum()
# Sample num_null observed child heights for that gender.
fill_values = np.random.choice(col.dropna(), num_null)
# Fill in missing values and return ser.
col[col.isna()] = fill_values
return col
Let's use transform
to call create_imputed
separately on each 'gender'
.
heights_mar_pfilled = heights_mar.copy()
heights_mar_pfilled['child'] = heights_mar.groupby('gender')['child'].transform(create_imputed)
heights_mar_pfilled['child'].head()
0 73.2 1 69.2 2 62.0 3 62.5 4 73.5 Name: child, dtype: float64
df_map['MAR, Conditionally Probabilistically Imputed'] = heights_mar_pfilled
util.multiple_kdes(df_map)
The green distribution (conditional probabilistic imputation) does the best job of approximating the turquoise distribution (the full dataset with no missing values)!
Remember that the graph above is interactive – you can hide/show lines by clicking them in the legend.
Steps:
Let's try this procedure out on the heights_mcar
dataset.
heights_mcar.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | NaN |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | NaN |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
Each time we run the following cell, it generates a new imputed version of the 'child'
column.
create_imputed(heights_mcar['child']).head()
0 73.2 1 69.2 2 60.0 3 63.0 4 73.5 Name: child, dtype: float64
Let's run the above procedure 100 times.
mult_imp = pd.concat([create_imputed(heights_mcar['child']).rename(k) for k in range(100)], axis=1)
mult_imp.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | ... | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 |
1 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | ... | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 |
2 | 62.0 | 68.0 | 65.7 | 65.5 | 70.0 | 66.0 | 69.0 | 70.0 | 70.0 | 65.5 | ... | 64.5 | 67.5 | 75.0 | 67.5 | 73.0 | 71.0 | 71.0 | 70.5 | 69.5 | 65.0 |
3 | 70.0 | 62.5 | 63.0 | 67.0 | 69.0 | 72.0 | 64.5 | 69.7 | 68.5 | 70.5 | ... | 64.0 | 65.0 | 63.0 | 65.0 | 62.5 | 67.5 | 61.7 | 63.0 | 62.0 | 65.5 |
4 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | ... | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 |
5 rows × 100 columns
Let's plot some of the imputed columns on the previous slide.
# Random sample of 15 imputed columns.
mult_imp_sample = mult_imp.sample(15, axis=1)
fig = ff.create_distplot(mult_imp_sample.to_numpy().T, list(mult_imp_sample.columns), show_hist=False, show_rug=False)
fig.update_xaxes(title='child')
Let's look at the distribution of means across the imputed columns.
px.histogram(pd.DataFrame(mult_imp.mean()), nbins=15, histnorm='probability',
title='Distribution of Imputed Sample Means')
See the end of Lecture 13 for a detailed summary of all imputation techniques that we've seen so far.
The material we're covering now is not on the Midterm Exam.
.csv
files.UCSD was a node in ARPANET, the predecessor to the modern internet (source).
The request methods you will use most often are GET
and POST
; see Mozilla's web docs for a detailed list of request methods.
GET
is used to request data from a specified resource.POST
is used to send data to the server. GET
request¶Below is an example GET
HTTP request made by a browser when accessing datascience.ucsd.edu.
GET / HTTP/1.1
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Connection: keep-alive
Accept-Language: en-US,en;q=0.9
GET / HTTP/1.1
) is called the "request line", and the lines afterwards are called "header fields". Header fields contain metadata. GET
response¶The response below was generated by executing the request on the previous slide.
HTTP/1.1 200 OK
Date: Fri, 29 Apr 2022 02:54:41 GMT
Server: Apache
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/2427>; rel="alternate"; type="application/json"
Link: <https://datascience.ucsd.edu/>; rel=shortlink
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<link rel="profile" href="https://gmpg.org/xfn/11">
<style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none
...
Read Inside a viral website, an account of what it's like to run a site that gained 50 million+ views in 5 days.
We'll see two ways to make HTTP requests outside of a browser:
curl
.requests
package.curl
¶curl
is a command-line tool that sends HTTP requests, like a browser.
curl
, sends a HTTP request. GET
or POST
).GET
requests via curl
¶curl
issues a GET
request.# `-v` is short for verbose
curl -v https://httpbin.org/html
!
before them. Let's try that here.# Compare the output to what you see when you go to https://httpbin.org/html in your browser!
!curl -v https://httpbin.org/html
* Trying 52.1.93.201:443... * Connected to httpbin.org (52.1.93.201) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /Users/surajrampure/opt/anaconda3/ssl/cacert.pem * CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.2 (IN), TLS handshake, Certificate (11): * TLSv1.2 (IN), TLS handshake, Server key exchange (12): * TLSv1.2 (IN), TLS handshake, Server finished (14): * TLSv1.2 (OUT), TLS handshake, Client key exchange (16): * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.2 (OUT), TLS handshake, Finished (20): * TLSv1.2 (IN), TLS handshake, Finished (20): * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=httpbin.org * start date: Oct 21 00:00:00 2022 GMT * expire date: Nov 19 23:59:59 2023 GMT * subjectAltName: host "httpbin.org" matched cert's "httpbin.org" * issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon * SSL certificate verify ok. * Using HTTP2, server supports multiplexing * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * Using Stream ID: 1 (easy handle 0x7fe989812c00) > GET /html HTTP/2 > Host: httpbin.org > user-agent: curl/7.78.0 > accept: */* > * Connection state changed (MAX_CONCURRENT_STREAMS == 128)! < HTTP/2 200 < date: Fri, 10 Feb 2023 07:06:06 GMT < content-type: text/html; charset=utf-8 < content-length: 3741 < server: gunicorn/19.9.0 < access-control-allow-origin: * < access-control-allow-credentials: true < <!DOCTYPE html> <html> <head> </head> <body> <h1>Herman Melville - Moby-Dick</h1> <div> <p> Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petulance did come from him. Silent, slow, and solemn; bowing over still further his chronically broken back, he toiled away, as if toil were life itself, and the heavy beating of his hammer the heavy beating of his heart. And so it was.—Most miserable! A peculiar walk in this old man, a certain slight but painful appearing yawing in his gait, had at an early period of the voyage excited the curiosity of the mariners. And to the importunity of their persisted questionings he had finally given in; and so it came to pass that every one now knew the shameful story of his wretched fate. Belated, and not innocently, one bitter winter's midnight, on the road running between two country towns, the blacksmith half-stupidly felt the deadly numbness stealing over him, and sought refuge in a leaning, dilapidated barn. The issue was, the loss of the extremities of both feet. Out of this revelation, part by part, at last came out the four acts of the gladness, and the one long, and as yet uncatastrophied fifth act of the grief of his life's drama. He was an old man, who, at the age of nearly sixty, had postponedly encountered that thing in sorrow's technicals called ruin. He had been an artisan of famed excellence, and with plenty to do; owned a house and garden; embraced a youthful, daughter-like, loving wife, and three blithe, ruddy children; every Sunday went to a cheerful-looking church, planted in a grove. But one night, under cover of darkness, and further concealed in a most cunning disguisement, a desperate burglar slid into his happy home, and robbed them all of everything. And darker yet to tell, the blacksmith himself did ignorantly conduct this burglar into his family's heart. It was the Bottle Conjuror! Upon the opening of that fatal cork, forth flew the fiend, and shrivelled up his home. Now, for prudent, most wise, and economic reasons, the blacksmith's shop was in the basement of his dwelling, but with a separate entrance to it; so that always had the young and loving healthy wife listened with no unhappy nervousness, but with vigorous pleasure, to the stout ringing of her young-armed old husband's hammer; whose reverberations, muffled by passing through the floors and walls, came up to her, not unsweetly, in her nursery; and so, to stout Labor's iron lullaby, the blacksmith's infants were rocked to slumber. Oh, woe on woe! Oh, Death, why canst thou not sometimes be timely? Hadst thou taken this old blacksmith to thyself ere his full ruin came upon him, then had the young widow had a delicious grief, and her orphans a truly venerable, legendary sire to dream of in their after years; and all of them a care-killing competency. </p> </div> </body> * Connection #0 to host httpbin.org left intact </html>
GET
request¶?
begins a query.requests
¶requests
is a Python module that allows you to use Python to interact with the internet! urllib
), but requests
is arguably the easiest to use.import requests
GET
requests via requests
¶To access the source code of the UCSD home page, all we need to run is the following:
requests.get('https://ucsd.edu').text
res = requests.get('https://ucsd.edu')
res
is now a Response
object.
res
<Response [200]>
The text
attribute of res
is a string that containing the entire response.
type(res.text)
str
len(res.text)
42422
print(res.text[:1000])
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="width=device-width, initial-scale=1" name="viewport"/> <title>University of California San Diego</title> <meta content="University of California, San Diego" name="ORGANIZATION"/> <meta content="index,follow,noarchive" name="robots"/> <meta content="UCSD" name="SITE"/> <meta content="University of California San Diego" name="PAGETITLE"/> <meta content="The University California San Diego is one of the world's leading public research universities, located in beautiful La Jolla, California" name="DESCRIPTION"/> <link href="favicon.ico" rel="icon"/> <!-- Site-specific CSS files --> <link href="https://www.ucsd.edu/_resources/css/vendor/brix_sans.css" rel="stylesheet" type="text/css"/> <link href="https://www.ucsd.edu/_resources/css/vendor/refrigerator_deluxe.css" rel="stylesheet"
POST
requests via requests
¶The following call to requests.post
makes a post request to https://httpbin.org/post, with a 'name'
parameter of 'King Triton'
.
post_res = requests.post('https://httpbin.org/post',
data={'name': 'King Triton'})
post_res
<Response [200]>
post_res.text
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "name": "King Triton"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate, br", \n "Content-Length": "16", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.26.0", \n "X-Amzn-Trace-Id": "Root=1-63e5ecdf-1b936f781221b1444ab4a41f"\n }, \n "json": null, \n "origin": "70.95.172.151", \n "url": "https://httpbin.org/post"\n}\n'
# More on this shortly!
post_res.json()
{'args': {}, 'data': '', 'files': {}, 'form': {'name': 'King Triton'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Length': '16', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.26.0', 'X-Amzn-Trace-Id': 'Root=1-63e5ecdf-1b936f781221b1444ab4a41f'}, 'json': None, 'origin': '70.95.172.151', 'url': 'https://httpbin.org/post'}
What happens when we try and make a POST
request somewhere where we're unable to?
yt_res = requests.post('https://youtube.com',
data={'name': 'King Triton'})
yt_res
<Response [400]>
yt_res.text
is a string containing HTML – we can render this in-line using IPython.display.HTML
.
from IPython.display import HTML
200
, which means there were no issues. 400
– bad request, 404
– page not found, 500
– internal server error.404
.yt_res.status_code
400
ok
attribute, which returns a bool.yt_res.status_code, yt_res.ok
(400, False)
post_res.status_code, post_res.ok
(200, True)
time.sleep
).Responses typically come in one of two formats: HTML or JSON.
GET
request is usually either JSON (when using an API) or HTML (when accessing a webpage).POST
request is usually JSON.Type | Description |
---|---|
String | Anything inside double quotes. |
Number | Any number (no difference between ints and floats). |
Boolean | true and false . |
Null | JSON's empty value, denoted by null . |
Array | Like Python lists. |
Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |
See json-schema.org for more details.
import json
f = open(os.path.join('data', 'family.json'), 'r')
family_tree = json.load(f)
family_tree
{'name': 'Grandma', 'age': 94, 'children': [{'name': 'Dad', 'age': 60, 'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]}, {'name': 'My Aunt', 'children': [{'name': 'Cousin 1', 'age': 34}, {'name': 'Cousin 2', 'age': 36, 'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}
family_tree['children'][0]['children'][0]['age']
24
eval
¶eval
, which stands for "evaluate", is a function built into Python.x = 4
eval('x + 5')
9
eval
can do the same thing that json.load
does...f = open(os.path.join('data', 'family.json'), 'r')
eval(f.read())
{'name': 'Grandma', 'age': 94, 'children': [{'name': 'Dad', 'age': 60, 'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]}, {'name': 'My Aunt', 'children': [{'name': 'Cousin 1', 'age': 34}, {'name': 'Cousin 2', 'age': 36, 'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}
eval
. The next slide demonstrates why.eval
gone wrong¶Observe what happens when we use eval
on a string representation of a JSON object:
f_other = open(os.path.join('data', 'evil_family.json'))
eval(f_other.read())
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_33104/3392341705.py in <module> 1 f_other = open(os.path.join('data', 'evil_family.json')) ----> 2 eval(f_other.read()) <string> in <module> ~/Desktop/80/wi23/private/lectures/schedules/mwf/lec14/src/util.py in err() 171 # For JSON evaluation example 172 def err(): --> 173 raise ValueError('i just deleted all your files lol 😂') ValueError: i just deleted all your files lol 😂
evil_family.json
, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.eval
evaluates all parts of the input string as if it were Python code..json()
method of a response object, or use the json
library.json
module¶Let's process the same file using the json
module. Recall:
json.load(f)
loads a JSON file from a file object.json.loads(f)
loads a JSON file from a string.f_other = open(os.path.join('data', 'evil_family.json'))
s = f_other.read()
s
'{\n "name": "Grandma",\n "age": 94,\n "children": [\n {\n "name": util.err(),\n "age": 60,\n "children": [{"name": "Me", "age": 24}, \n {"name": "Brother", "age": 22}]\n },\n {\n "name": "My Aunt",\n "children": [{"name": "Cousin 1", "age": 34}, \n {"name": "Cousin 2", "age": 36, "children": \n [{"name": "Cousin 2 Jr.", "age": 2}]\n }\n ]\n }\n ]\n}'
json.loads(s)
--------------------------------------------------------------------------- JSONDecodeError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_33104/1938830664.py in <module> ----> 1 json.loads(s) ~/opt/anaconda3/lib/python3.9/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 344 parse_int is None and parse_float is None and 345 parse_constant is None and object_pairs_hook is None and not kw): --> 346 return _default_decoder.decode(s) 347 if cls is None: 348 cls = JSONDecoder ~/opt/anaconda3/lib/python3.9/json/decoder.py in decode(self, s, _w) 335 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end() 339 if end != len(s): ~/opt/anaconda3/lib/python3.9/json/decoder.py in raw_decode(self, s, idx) 353 obj, end = self.scan_once(s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end JSONDecodeError: Expecting value: line 6 column 17 (char 84)
util.err()
is not a string in JSON (there are no quotes around it), json.loads
is not able to parse it as a JSON object.eval
on "raw" data that you didn't create!GET
HTTP requests to ask for information and POST
HTTP requests to send information.curl
in the command-line or the requests
Python module to make HTTP requests..json()
method of a response object or the json
package to parse them, not eval
.