import pandas as pd
import numpy as np
import os
import util
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(10, 5))
plt.rc('font', size=12)
heights = pd.read_csv(os.path.join('data', 'heights.csv'))
heights = (
heights
.rename(columns={'childHeight': 'child', 'childNum': 'number'})
.drop('midparentHeight', axis=1)
)
heights.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
np.random.seed(42) # So that we get the same results each time (for lecture)
heights_mcar = util.make_mcar(heights, 'child', pct=0.50)
heights_mcar.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | NaN |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | NaN |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
The 'child'
column has missing values.
'child'
is MCAR, then fill in each of the missing values using the mean of the observed values.'child'
is MAR dependent on a categorical column, then fill in each of the missing values using the mean of the observed values in each category. For instance, if 'child'
is MAR dependent on 'gender'
, we can fill in:'child'
heights with the observed mean for female children, and'child'
heights with the observed mean for male children.'child'
is MAR dependent on a numerical column, then bin the numerical column to make it categorical, then follow the procedure above. See Lab 5, Question 3!heights_mcar_mfilled = heights_mcar.fillna(heights_mcar['child'].mean())
heights_mcar_mfilled['child'].head()
0 73.200000 1 69.200000 2 66.640685 3 66.640685 4 73.500000 Name: child, dtype: float64
plt.hist([heights['child'], heights_mcar['child'].dropna(), heights_mcar_mfilled['child']])
plt.legend(['full data', 'missing (mcar)', 'imputed']);
The 'child'
column has missing values.
'child'
is MCAR, then fill in each of the missing values with randomly selected observed 'child'
heights.'child'
values, pick 5 of the not-missing 'child'
values.'child'
is MAR dependent on a categorical column, sample from the observed values separately for each category.# Figure out the number of missing values
num_null = heights_mcar['child'].isna().sum()
# Sample that number of values from the observed dataset
fill_values = heights_mcar['child'].dropna().sample(num_null, replace=True)
# Find the positions where values in heights_mcar are missing
fill_values.index = heights_mcar.loc[heights_mcar['child'].isna()].index
# Fill in the missing values
heights_mcar_dfilled = heights_mcar.fillna({'child': fill_values.to_dict()}) # fill the vals
plt.hist([heights['child'], heights_mcar['child'], heights_mcar_dfilled['child']], density=True);
plt.legend(['full data','missing (mcar)', 'distr imputed']);
No spikes!
If a value was never observed in the dataset, it will never be used to fill in a missing value.
Solution? Create a histogram (with np.histogram
) to bin the data, then sample from the histogram. See Lab 5, Question 4.
Steps:
Create several imputed versions of the data through a probabilistic procedure.
Then, estimate the parameters of interest for each imputed dataset.
Let's try this procedure out on the heights_mcar
dataset.
heights_mcar.head()
family | father | mother | children | number | gender | child | |
---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 4 | 3 | female | NaN |
3 | 1 | 78.5 | 67.0 | 4 | 4 | female | NaN |
4 | 2 | 75.5 | 66.5 | 4 | 1 | male | 73.5 |
# This function implements the 3-step process we studied earlier
def create_imputed(col):
num_null = col.isna().sum()
fill_values = col.dropna().sample(num_null, replace=True)
fill_values.index = col.loc[col.isna()].index
return col.fillna(fill_values.to_dict())
Each time we run the following cell, it generates a new imputed version of the 'child'
column.
create_imputed(heights_mcar['child']).head()
0 73.2 1 69.2 2 67.7 3 66.0 4 73.5 Name: child, dtype: float64
Let's run the above procedure 100 times.
mult_imp = pd.concat([create_imputed(heights_mcar['child']).rename(k) for k in range(100)], axis=1)
mult_imp.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | ... | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 | 73.2 |
1 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | ... | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 | 69.2 |
2 | 64.5 | 72.0 | 67.0 | 69.0 | 69.0 | 70.0 | 69.0 | 67.0 | 67.0 | 64.0 | ... | 65.7 | 65.0 | 65.7 | 70.7 | 73.0 | 65.0 | 66.0 | 63.0 | 71.0 | 69.0 |
3 | 63.5 | 64.5 | 61.7 | 67.2 | 61.0 | 64.0 | 66.0 | 56.0 | 60.0 | 62.0 | ... | 67.2 | 64.5 | 68.0 | 62.0 | 64.0 | 62.0 | 66.0 | 65.5 | 69.0 | 64.5 |
4 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | ... | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 | 73.5 |
5 rows × 100 columns
Let's plot some of the imputed columns above.
# Random sample of 15 imputed columns
mult_imp.sample(15, axis=1).plot(kind='kde', alpha=0.5, legend=False);
Let's look at the distribution of means across the imputed columns.
mult_imp.mean().plot(kind='hist', bins=20, ec='w', density=True);
See the end of Lecture 13 for a detailed summary of all imputation techniques that we've seen so far.
.csv
files.UCSD was a node in ARPANET, the predecessor to the modern internet (source).
HTTP follows the request-response model.
The request methods you will use most often are GET
and POST
.
GET
is used to request data from a specified resource.
POST
is used to send data to the server.
See Mozilla's web docs for a detailed list of request methods.
GET
request¶Below is an example GET
HTTP request made by a browser when accessing datascience.ucsd.edu.
GET / HTTP/1.1
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Connection: keep-alive
Accept-Language: en-US,en;q=0.9
GET / HTTP/1.1
) is called the "request line", and the lines afterwards are called "header fields". We could also provide a "body" after the header fields.GET
response¶The response below was generated by executing the request on the previous slide.
HTTP/1.1 200 OK
Date: Fri, 29 Apr 2022 02:54:41 GMT
Server: Apache
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/2427>; rel="alternate"; type="application/json"
Link: <https://datascience.ucsd.edu/>; rel=shortlink
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<link rel="profile" href="https://gmpg.org/xfn/11">
<style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none
...
Read Inside a viral website, an account of what it's like to run a site that gained 50 million+ views in 5 days.
There are (at least) two ways to make HTTP requests:
curl
.requests
package.curl
¶curl
is a command-line tool that sends HTTP requests, like a browser.
curl
, sends a HTTP request. GET
or POST
).GET
requests via curl
¶curl
issues a GET
request.!
before them.curl -v https://httpbin.org/html
# (`-v` is short for verbose)
!curl -v https://httpbin.org/html
GET
request¶?
begins a query. For instance,https://www.google.com/search?q=ucsd+dsc+80+hard&client=safari
POST
requests via curl
¶curl
, -d
is short for POST
.curl
POST
request that sends 'King Triton'
as the parameter 'name'
.curl -d 'name=King Triton' https://httpbin.org/post
!curl -d 'name=King Triton' https://httpbin.org/post
!curl -d 'name=King Triton' https://youtube.com
requests
¶requests
is a Python package that allows you to use Python to interact with the internet! urllib
), but requests
is arguably the easiest to use.import requests
GET
requests via requests
¶To access the source code of the UCSD home page, all we need to run is the following:
text = requests.get('https://ucsd.edu').text
url = 'https://ucsd.edu'
resp = requests.get(url)
resp
is now a Response
object.
resp
<Response [200]>
The text
attribute of resp
is a string that containing the entire response.
type(resp.text)
str
len(resp.text)
43692
print(resp.text[:1000])
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="width=device-width, initial-scale=1" name="viewport"/> <title>University of California San Diego</title> <meta content="University of California, San Diego" name="ORGANIZATION"/> <meta content="index,follow,noarchive" name="robots"/> <meta content="UCSD" name="SITE"/> <meta content="University of California San Diego" name="PAGETITLE"/> <meta content="The University California San Diego is one of the world's leading public research universities, located in beautiful La Jolla, California" name="DESCRIPTION"/> <link href="favicon.ico" rel="icon"/> <!-- Site-specific CSS files --> <link href="https://www.ucsd.edu/_resources/css/vendor/brix_sans.css" rel="stylesheet" type="text/css"/> <!-- CSS complied from style overrides --> <link href="https://www.ucsd.edu/_resources/css/s
The url
attribute contains the URL that we accessed.
resp.request.url
'https://ucsd.edu/'
POST
requests via requests
¶post_response = requests.post('https://httpbin.org/post',
data={'name': 'King Triton'})
post_response
<Response [200]>
print(post_response.text)
{ "args": {}, "data": "", "files": {}, "form": { "name": "King Triton" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Content-Length": "16", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.26.0", "X-Amzn-Trace-Id": "Root=1-626b9412-7d96f8fc5ad980b61f34ae79" }, "json": null, "origin": "70.95.172.151", "url": "https://httpbin.org/post" }
200
, which means there were no issues. 404
: page not found; 500
: internal server error.404
.r = requests.get('https://httpstat.us/503')
print(r.status_code)
503
r.text
'503 Service Unavailable'
ok
attribute, which returns a bool.time.sleep
).status_codes = [200, 201, 403, 404, 503]
for code in status_codes:
r = requests.get(f'https://httpstat.us/{code}')
print(f'{code} ok: {r.ok}')
200 ok: True 201 ok: True 403 ok: False 404 ok: False 503 ok: False
raise_for_status
request method raises an exception when the status code is not-ok.requests.get('https://httpstat.us/400').raise_for_status()
--------------------------------------------------------------------------- HTTPError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_8477/2094305732.py in <module> ----> 1 requests.get('https://httpstat.us/400').raise_for_status() ~/opt/anaconda3/lib/python3.9/site-packages/requests/models.py in raise_for_status(self) 951 952 if http_error_msg: --> 953 raise HTTPError(http_error_msg, response=self) 954 955 def close(self): HTTPError: 400 Client Error: Bad Request for url: https://httpstat.us/400
The internet currently relies on two key data formats – HTML and JSON.
GET
request is usually either JSON (when using an API) or HTML (when accessing a webpage).POST
request is usually JSON.true
and false
.[]
.null
.See json-schema.org for more details.
import json
f = open(os.path.join('data', 'family.json'), 'r')
family_tree = json.load(f)
family_tree
{'name': 'Grandma', 'age': 94, 'children': [{'name': 'Dad', 'age': 60, 'children': [{'name': 'Me', 'age': 23}, {'name': 'Brother', 'age': 21}]}, {'name': 'My Aunt', 'children': [{'name': 'Cousin 1', 'age': 34}, {'name': 'Cousin 2', 'age': 36, 'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}
family_tree['children'][0]['children'][0]['age']
23
eval
¶eval
, which stands for "evaluate", is a function built into Python.x = 4
eval('x + 5')
9
eval
can do the same thing that json.load
does...f = open(os.path.join('data', 'family.json'), 'r')
eval(f.read())
{'name': 'Grandma', 'age': 94, 'children': [{'name': 'Dad', 'age': 60, 'children': [{'name': 'Me', 'age': 23}, {'name': 'Brother', 'age': 21}]}, {'name': 'My Aunt', 'children': [{'name': 'Cousin 1', 'age': 34}, {'name': 'Cousin 2', 'age': 36, 'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}
eval
. The next slide demonstrates why.eval
gone wrong¶eval
on a string representation of a JSON object:f_other = open(os.path.join('data', 'evil_family.json'))
eval(f_other.read())
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_8477/3392341705.py in <module> 1 f_other = open(os.path.join('data', 'evil_family.json')) ----> 2 eval(f_other.read()) <string> in <module> ~/Desktop/80/private/lectures/sp22/lec14/util.py in err() 132 # For JSON evaluation example 133 def err(): --> 134 raise ValueError('i just deleted all your files lol 😂') ValueError: i just deleted all your files lol 😂
evil_family.json
, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.eval
evaluates all parts of the input string as if it were Python code. You never need to do this – instead, use the json
library.json.load
loads a JSON file from a file.json.loads
loads a JSON file from a string.f_other = open(os.path.join('data', 'evil_family.json'))
s = f_other.read()
s
'{\n "name": "Grandma",\n "age": 94,\n "children": [\n {\n "name": util.err(),\n "age": 60,\n "children": [{"name": "Me", "age": 23}, \n {"name": "Brother", "age": 21}]\n },\n {\n "name": "My Aunt",\n "children": [{"name": "Cousin 1", "age": 34}, \n {"name": "Cousin 2", "age": 36, "children": \n [{"name": "Cousin 2 Jr.", "age": 2}]\n }\n ]\n }\n ]\n}'
json.loads(s)
--------------------------------------------------------------------------- JSONDecodeError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_8477/1938830664.py in <module> ----> 1 json.loads(s) ~/opt/anaconda3/lib/python3.9/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 344 parse_int is None and parse_float is None and 345 parse_constant is None and object_pairs_hook is None and not kw): --> 346 return _default_decoder.decode(s) 347 if cls is None: 348 cls = JSONDecoder ~/opt/anaconda3/lib/python3.9/json/decoder.py in decode(self, s, _w) 335 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end() 339 if end != len(s): ~/opt/anaconda3/lib/python3.9/json/decoder.py in raw_decode(self, s, idx) 353 obj, end = self.scan_once(s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end JSONDecodeError: Expecting value: line 6 column 17 (char 84)
util.err()
is not a string in JSON (there are no quotes around it), json.loads
is not able to parse it as a JSON object.eval
on "raw" data that you didn't create!GET
HTTP requests to ask for information and POST
HTTP requests to send information.curl
in the command-line or the requests
Python package to make HTTP requests.json
package to parse them, not eval
.