from dsc80_utils import *
Lecture 6 – EDA Part 2, Hypothesis Testing¶
DSC 80, Fall 2024¶
In case you need a review from DSC 10, I've made a Pre-Lecture Review for this lecture.
Announcements 📣¶
- Project 1 is due today, Oct 15.
- Lab 3 is due on Fri, Oct 18.
- When submitting answers for attendance, copy-paste what you have. Answers that are some variant of "I don't know" will be treated as no submission. Take your best guess!
Agenda 📆¶
- Finish up our EDA example from last lecture
- Data scope.
- Overview of hypothesis testing.
- Example: Total variation distance.
- Permutation testing.
- Example: Birth weight and smoking 🚬.
- Example (that you'll read on your own): Permutation testing meets TVD.
SD Food Safety Data¶
First, let's retrace our steps from last lecture.
rest_path = Path('data') / 'restaurants.csv'
insp_path = Path('data') / 'inspections.csv'
viol_path = Path('data') / 'violations.csv'
def subset_rest(rest):
return rest[['business_id', 'name', 'address', 'zip', 'opened_date']]
def subset_insp(insp):
return (
insp[['business_id', 'inspection_id', 'score', 'grade', 'completed_date', 'status']]
.rename(columns={'completed_date': 'date'})
)
def subset_viol(viol):
return (
viol[['inspection_id', 'violation', 'major_violation', 'violation_accela']]
.rename(columns={'violation': 'kind',
'major_violation': 'is_major',
'violation_accela': 'violation'})
)
rest = (pd.read_csv(rest_path)
.pipe(subset_rest))
insp = (pd.read_csv(insp_path)
.pipe(subset_insp))
viol = (pd.read_csv(viol_path)
.pipe(subset_viol))
Combining the restaurant data¶
Let's join all three DataFrames together so that we have all the data in a single DataFrame.
def merge_all_restaurant_data():
return (
rest
.merge(insp, on='business_id', how='left')
.merge(viol, on='inspection_id', how='left')
)
df = merge_all_restaurant_data()
df.head(2)
business_id | name | address | zip | ... | status | kind | is_major | violation | |
---|---|---|---|---|---|---|---|---|---|
0 | 211898487641 | MOBIL MART LA JOLLA VILLAGE | 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 | 92037 | ... | Complete | Hot and Cold Water | Y | 21. Hot & cold water available |
1 | 211898487641 | MOBIL MART LA JOLLA VILLAGE | 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 | 92037 | ... | Complete | Hot and Cold Water | N | 21. Hot & cold water available |
2 rows × 13 columns
Question 🤔 (Answer at dsc80.com/q)
Code: lefts
Why should the function above use two left joins? What would go wrong if we used other kinds of joins?
Data cleaning: Missing values¶
Missing values¶
Next, it's important to check for and handle missing values, as they can have a big effect on your analysis.
insp[['score', 'grade']]
score | grade | |
---|---|---|
0 | 96 | NaN |
1 | 98 | NaN |
2 | 98 | NaN |
... | ... | ... |
5176 | 0 | NaN |
5177 | 0 | NaN |
5178 | 90 | A |
5179 rows × 2 columns
# The proportion of values in each column that are missing.
insp.isna().mean()
business_id 0.00 inspection_id 0.00 score 0.00 grade 0.42 date 0.00 status 0.00 dtype: float64
# Why are there null values here?
# insp['inspection_id'] and viol['inspection_id'] don't have any null values...
df[df['inspection_id'].isna()]
business_id | name | address | zip | ... | status | kind | is_major | violation | |
---|---|---|---|---|---|---|---|---|---|
759 | 211941133403 | TASTY CHAI | 8878 REGENTS RD 105, SAN DIEGO, CA 92122-5853 | 92122-5853 | ... | NaN | NaN | NaN | NaN |
1498 | 211915545446 | EMBASSY SUITES SAN DIEGO LA JOLLA | 4550 LA JOLLA VILLAGE DR, SAN DIEGO, CA 92122-... | 92122-1248 | ... | NaN | NaN | NaN | NaN |
1672 | 211937443689 | SERVICENOW | 4770 EASTGATE MALL, SAN DIEGO, CA 92121-1970 | 92121-1970 | ... | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8094 | 211997340975 | COOKIE SCOOP | 7759 GASTON DR, SAN DIEGO, CA 92126-3036 | 92126-3036 | ... | NaN | NaN | NaN | NaN |
8450 | 211900595220 | I LOVE BANANA BREAD CO | 4068 DALLES AVE, SAN DIEGO, CA 92117-5518 | 92117-5518 | ... | NaN | NaN | NaN | NaN |
8545 | 211963768842 | PETRA KITCHEN | 5252 BALBOA ARMS DR 175, SAN DIEGO, CA 92117-4949 | 92117-4949 | ... | NaN | NaN | NaN | NaN |
29 rows × 13 columns
There are many ways of handling missing values, which we'll cover in an entire lecture next week. But a good first step is to check how many there are!
Data cleaning: Transformations and timestamps¶
Transformations and timestamps¶
From last class:
A transformation results from performing some operation on every element in a sequence, e.g. a Series.
It's often useful to look at ways of transforming your data to make it easier to work with.
Type conversions (e.g. changing the string
"$2.99"
to the number2.99
).Unit conversion (e.g. feet to meters).
Extraction (Getting
'vermin'
out of'Vermin Violation Recorded on 10/10/2023'
).
Creating timestamps¶
Most commonly, we'll parse dates into pd.Timestamp
objects.
# Look at the dtype!
insp['date']
0 2023-02-16 1 2022-01-03 2 2020-12-03 ... 5176 2023-03-06 5177 2022-12-09 5178 2022-11-30 Name: date, Length: 5179, dtype: object
# This magical string tells Python what format the date is in.
# For more info: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
date_format = '%Y-%m-%d'
pd.to_datetime(insp['date'], format=date_format)
0 2023-02-16 1 2022-01-03 2 2020-12-03 ... 5176 2023-03-06 5177 2022-12-09 5178 2022-11-30 Name: date, Length: 5179, dtype: datetime64[ns]
# Another advantage of defining functions is that we can reuse this function
# for the 'opened_date' column in `rest` if we wanted to.
def parse_dates(insp, col):
date_format = '%Y-%m-%d'
dates = pd.to_datetime(insp[col], format=date_format)
return insp.assign(**{col: dates})
insp = (
pd.read_csv(insp_path)
.pipe(subset_insp)
.pipe(parse_dates, 'date')
)
# We should also remake df, since it depends on insp.
# Note that the new insp is used to create df!
df = merge_all_restaurant_data()
# Look at the dtype now!
df['date']
0 2023-02-16 1 2022-01-03 2 2020-12-03 ... 8728 2022-11-30 8729 2022-11-30 8730 2022-11-30 Name: date, Length: 8731, dtype: datetime64[ns]
Working with timestamps¶
- We often want to adjust granularity of timestamps to see overall trends, or seasonality.
- Use the
resample
method inpandas
(documentation).- Think of it like a version of
groupby
, but for timestamps. - For instance,
insp.resample('2W', on='date')
separates every two weeks of data into a different group.
- Think of it like a version of
insp.resample('2W', on='date')['score'].mean()
date 2020-01-05 42.67 2020-01-19 59.33 2020-02-02 56.34 ... 2023-09-24 66.60 2023-10-08 59.58 2023-10-22 66.81 Freq: 2W-SUN, Name: score, Length: 100, dtype: float64
# Where are those numbers coming from?
insp[
(insp['date'] >= pd.Timestamp('2020-01-05')) &
(insp['date'] < pd.Timestamp('2020-01-19'))
]['score']
10 0 11 92 12 0 ... 4709 0 4988 100 5107 96 Name: score, Length: 86, dtype: int64
(insp.resample('2W', on='date')
.size()
.plot(title='Number of Inspections Over Time')
)