Lecture 5 – Unfaithful Data, Hypothesis Testing

DSC 80, Spring 2022



Unfaithful data

Is the data "faithful" to the DGP?

Is the data "faithful" to the DGP?

Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

General questions

  1. Check the data types. Notice any issues?
  2. Do string fields have consistent values?
  3. Are there missing values that we don't understand?
  4. Are all values within a reasonable range?
  5. How do we deal with the messiness we find?

Data types


Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

Unfaithful 'subject_age'

Human-entered data

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice?

Now let's look at the first few and last few rows of stops.

Do you think '-0:81' is a time that a computer would record?

Unfaithful data vs. outliers


Reminder: tools 🛠

You'll use the following methods regularly when initially exploring a dataset.

Missing values

Where'd you go?

Common representations of "null"

Common representations of "null"

Missing values in the stops dataset

What are the non-np.NaN null values in the stops dataset?

Finding null values in pandas

Dropping observations with null values

Dropping observations with null values

When used on a DataFrame:

Filling null values

The fillna method replaces all null values. Specifically:

Data types and np.NaN

More soon...

Hypothesis testing

Answering questions with confidence 💪

Now our data is clean and we're confident that it's faithful to the data generating process.

How do we ask questions and draw conclusions about the data generating process, using our observed data?

Run the following cell to set things up.

Was the coin fair? 🪙

Null hypothesis

Test statistics

Making decisions

Running a hypothesis test, DSC 10 style

Let's use the number of heads ($N_H$) as our test statistic. We need to:

  1. Compute the observed value of the test statistic, i.e. the observed number of heads.
  2. Simulate values of the test statistic under the null, i.e. under the assumption that the coin was fair.
  3. Use the resulting distribution to calculate the (approximate) probability of seeing 68 or more heads, under the assumption the coin was fair.

Each entry in results is the number of heads in 114 simulated coin flips.

Plotting the empirical distribution of the test statistic

Question: Do you think the coin was fair?

Summary, next time