import pandas as pd
import numpy as np
import os
import seaborn as sns
import plotly.express as px
pd.options.plotting.backend = 'plotly'
Additional resources:
"Standard" hypothesis testing helps us answer questions of the form:
I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?
Permutation testing helps us answer questions of the form:
I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?
Let's load in a cleaned version of the couples dataset from the last lecture.
couples_fp = os.path.join('data', 'married_couples_cleaned.csv')
couples = pd.read_csv(couples_fp)
couples.head()
mar_status | empl_status | gender | age | |
---|---|---|---|---|
0 | married | Working as paid employee | M | 51 |
1 | married | Working as paid employee | F | 53 |
2 | married | Working as paid employee | M | 57 |
3 | married | Working as paid employee | F | 57 |
4 | married | Working as paid employee | M | 60 |
couples.sample(5)
mar_status | empl_status | gender | age | |
---|---|---|---|---|
774 | married | Not working - looking for work | M | 35 |
668 | unmarried | Working as paid employee | M | 50 |
490 | married | Working as paid employee | M | 52 |
367 | married | Not working - disabled | F | 55 |
208 | married | Working as paid employee | M | 49 |
To answer these questions, let's compute the distribution of employment status conditional on household type (married vs. unmarried).
# Note that this is a shortcut to picking a column for values and using aggfunc='count'.
empl_cnts = couples.pivot_table(index='empl_status', columns='mar_status', aggfunc='size')
cond_distr = empl_cnts / empl_cnts.sum()
cond_distr
mar_status | married | unmarried |
---|---|---|
empl_status | ||
Not working - disabled | 0.048518 | 0.077055 |
Not working - looking for work | 0.047844 | 0.118151 |
Not working - on a temporary layoff from a job | 0.014151 | 0.022260 |
Not working - other | 0.122642 | 0.056507 |
Not working - retired | 0.063342 | 0.018836 |
Working as paid employee | 0.610512 | 0.594178 |
Working, self-employed | 0.092992 | 0.113014 |
Are the distributions of employment status for married people and for unmarried people who live with their partners different?
Is this difference just due to noise?
cond_distr.plot(kind='barh', title='Distribution of Employment Status, Conditional on Household Type', barmode='group')
Null Hypothesis: In the US, the distribution of employment status among those who are married is the same as among those who are unmarried and live with their partners. The difference between the two observed samples is due to chance.
Alternative Hypothesis: In the US, the distributions of employment status of the two groups are different.
What is a good test statistic in this case?
Hint: What kind of distributions are we comparing?
cond_distr
mar_status | married | unmarried |
---|---|---|
empl_status | ||
Not working - disabled | 0.048518 | 0.077055 |
Not working - looking for work | 0.047844 | 0.118151 |
Not working - on a temporary layoff from a job | 0.014151 | 0.022260 |
Not working - other | 0.122642 | 0.056507 |
Not working - retired | 0.063342 | 0.018836 |
Working as paid employee | 0.610512 | 0.594178 |
Working, self-employed | 0.092992 | 0.113014 |
Let's first compute the observed TVD, using our new knowledge of the diff
method.
cond_distr.diff(axis=1).iloc[:, -1].abs().sum() / 2
0.1269754089281099
Since we'll need to calculate the TVD repeatedly, let's define a function that computes it.
def tvd_of_groups(df, groups, cats):
'''groups: the binary column (e.g. married vs. unmarried).
cats: the categorical column (e.g. employment status).
'''
cnts = df.pivot_table(index=cats, columns=groups, aggfunc='size')
# Normalize each column.
distr = cnts / cnts.sum()
# Compute and return the TVD.
return distr.diff(axis=1).iloc[:, -1].abs().sum() / 2
# Same result as above.
observed_tvd = tvd_of_groups(couples, groups='mar_status', cats='empl_status')
observed_tvd
0.1269754089281099
couples.head()
mar_status | empl_status | gender | age | |
---|---|---|---|---|
0 | married | Working as paid employee | M | 51 |
1 | married | Working as paid employee | F | 53 |
2 | married | Working as paid employee | M | 57 |
3 | married | Working as paid employee | F | 57 |
4 | married | Working as paid employee | M | 60 |
Here, we'll shuffle marital statuses, though remember, we could shuffle employment statuses too.
couples.assign(shuffled_mar=np.random.permutation(couples['mar_status']))
mar_status | empl_status | gender | age | shuffled_mar | |
---|---|---|---|---|---|
0 | married | Working as paid employee | M | 51 | married |
1 | married | Working as paid employee | F | 53 | married |
2 | married | Working as paid employee | M | 57 | unmarried |
3 | married | Working as paid employee | F | 57 | unmarried |
4 | married | Working as paid employee | M | 60 | unmarried |
... | ... | ... | ... | ... | ... |
2063 | unmarried | Working as paid employee | F | 42 | married |
2064 | unmarried | Working as paid employee | M | 60 | married |
2065 | unmarried | Working as paid employee | F | 53 | married |
2066 | unmarried | Working as paid employee | M | 44 | unmarried |
2067 | unmarried | Working as paid employee | F | 42 | married |
2068 rows × 5 columns
Let's do this repeatedly.
N = 1000
tvds = []
for _ in range(N):
# Shuffle marital statuses.
with_shuffled = couples.assign(shuffled_mar=np.random.permutation(couples['mar_status']))
# Compute and store the TVD.
tvd = tvd_of_groups(with_shuffled, groups='shuffled_mar', cats='empl_status')
tvds.append(tvd)
Notice that by defining a function that computes our test statistic, our simulation code is much cleaner.
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=50, histnorm='probability',
title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
x=1.15 * observed_tvd, showarrow=False, y=0.055)
fig.update_layout(xaxis_range=[0, 0.2])
p_95 = np.percentile(tvds, 95)
fig.add_vline(x=p_95, line_color='purple')
annot_text = f'<span style="color:purple">The 95th percentile of our<br>empirical distribution is {round(p_95, 2)}.<br><br>'
annot_text += 'If our observed statistic is to the<br>right of this point, we will reject the null<br>at a 5% <b>significance level</b>.</span>'
fig.add_annotation(text=annot_text, x=1.5 * np.percentile(tvds, 95), showarrow=False, y=0.05)
We reject the null hypothesis that married/unmarried households have similar employment makeups.
We can't say anything about why the employment makeups are different, though!
In the definition of the TVD, we divide the sum of the absolute differences in proportions between the two distributions by 2.
def tvd(a, b):
return np.sum(np.abs(a - b)) / 2
Question: If we divided by 200 instead of 2, would we still reject the null hypothesis?
pivot_table
, that would not give us an accurate picture of your understanding of DSC 80!We will focus on the second problem.
There are four key ways in which values can be missing. It is important to distinguish between these types so that we can correctly impute (fill in) the missing data.
'Age4'
is missing if and only if 'Number of People'
is less than 4.Example: 'Car Type?'
and 'Car Colour?'
are missing if and only if 'Own a car?'
is 'No'
.
Consider the following (contrived) example:
We are now missing birth months for the first 10 people we surveyed. What is the missingness mechanism for birth months if:
Remember:
A good strategy is to assess missingness in the following order.
In each of the following examples, decide whether the missing data are likely to be MD, NMAR, MAR, or MCAR:
'gender'
and 'age'
. 'age'
has missing values.'self-reported education level'
, which contains missing values.'Version 1'
, 'Version 2'
, and 'Version 3'
. $\frac{2}{3}$ of the entries in the table are NaN
.We won't spend much time on these in lecture, but you may find them helpful.
Suppose we have:
Data is missing completely at random (MCAR) if
$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$That is, adding information about the dataset doesn't change the likelihood data is missing!
Suppose we have:
Data is missing at random (MCAR) if
$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs}, \psi)$$That is, MAR data is actually MCAR, conditional on $Y_{obs}$.
Suppose we have:
Data is not missing at random (NMAR) if
$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$cannot be simplified. That is, in NMAR data, missingness is dependent on the missing value itself.
Phone | Screen Size | Price |
---|---|---|
iPhone 14 | 6.06 | 999 |
Galaxy Z Fold 4 | 7.6 | NaN |
OnePlus 9 Pro | 6.7 | 799 |
iPhone 13 Pro Max | 6.68 | NaN |
Missing completely at random (MCAR): The chance that a value is missing is completely independent of other columns and the actual missing value.
Important: Refer to the Flowchart when deciding between missingness types.