Lecture 8 – Unfaithful Data, Hypothesis Testing

DSC 80, Spring 2023

Agenda

Messy data

More data type ambiguities

Example: The Norway problem 🇳🇴

Unfaithful data

Is the data "faithful" to the DGP?

Is the data "faithful" to the DGP?

Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

Data types

Are the data types correct? If not, are they easily fixable?

Unfaithfulness

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

What about all of the stops that involved people under the legal driving age?

Unfaithful 'subject_age'

Human-entered data

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice? How could we address this?

Now let's look at the first few and last few rows of stops.

Do you think '-0:81' is a time that a computer would record?

Unfaithful data vs. outliers

Watch out for...

Missing values

Where'd you go?

Common representations of "null"

Common representations of "null"

Missing values in the stops dataset

What are the non-NaN null values in the stops dataset?

Finding null values in pandas

Dropping observations with null values

Dropping observations with null values

When used on a DataFrame:

Filling null values

As you've seen, the fillna method replaces all null values. Specifically:

Filling null values, so far

Hypothesis testing

Hypothesis testing

Example: Coin flipping

Test statistics

To decide, we need to know how rare it is to see 59 heads and 41 tails, or a result that's even more biased in favor of heads, when flipping a fair coin 100 times.

For the alternative hypothesis "the coin was biased towards heads", we could use:

For simplicity, we'll start with $N_H$.

Generating the null distribution

Generating the null distribution, using math

The number of heads in 100 flips of a fair coin follows the $\text{Binomial(100, 0.5)}$ distribution, in which

$$P(\text{# heads} = k) = {100 \choose k} (0.5)^k{(1-0.5)^{100-k}} = {100 \choose k} 0.5^{100}$$

The probability that we see at least 59 heads is then:

Let's look at this distribution visually.

Making a decision

We saw that, in 100 flips of a fair coin, $P(\text{# heads} \geq 59)$ is only ~4.4%.

Fun fact

Summary, next time

Summary

Next time