In [1]:

```
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
# plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('seaborn-white') # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(10, 5)) # set default size/resolution
plt.rc('font', size=12) # font size
```

- Project 1 is released!
- The Checkpoint is due
**tomorrow at 11:59PM**. - The whole project is due on
**Thursday, April 14th at 11:59PM**. - Use this sheet to find a pair programming partner.

- The Checkpoint is due
- Lab 2 is due on
**Monday, April 11th at 11:59PM**. - There is only one discussion section now, on Wednesdays from
**7-8:30PM**.

- Unfaithful data.
- Missing values.
- Hypothesis testing.

In other words, how well does the data represent reality?

Does the data contain unrealistic or "incorrect" values?

- Dates in the future for events in the past.
- Locations that don't exist.
- Negative counts.
- Misspellings of names.
- Large outliers.

- Does the data violate obvious dependencies?
- Age and birthday don't match.

- Was the data entered by hand?
- Spelling errors.
- Fields shifted.
- Did the form require fields or provide default values?

- Are there obvious signs of data falsification (aka "curbstoning")?
- Repeated names.
- Fake looking email addresses.
- Repeated use of uncommon names or fields.

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

- Check the data types. Notice any issues?
- Do string fields have consistent values?
- Are there missing values that we don't understand?
- Are all values within a reasonable range?
- How do we deal with the messiness we find?

In [2]:

```
stops = pd.read_csv('data/vehicle_stops_2016_datasd.csv')
stops.head()
```

Out[2]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |

1 | 1308172 | Moving Violation | 520 | B | M | 25 | 2016-01-01 00:10:00 | 2016-01-01 | 0:10 | N | N | N | NaN | NaN | NaN |

2 | 1308171 | Moving Violation | 110 | H | F | 31 | 2016-01-01 00:14:00 | 2016-01-01 | 0:14 | NaN | NaN | NaN | NaN | NaN | NaN |

3 | 1308170 | Moving Violation | Unknown | W | F | 29 | 2016-01-01 00:16:00 | 2016-01-01 | 0:16 | N | N | N | NaN | NaN | NaN |

4 | 1308197 | Moving Violation | 230 | W | M | 52 | 2016-01-01 00:30:00 | 2016-01-01 | 0:30 | N | N | N | NaN | NaN | NaN |

- Are the data types correct?
- If not, are they easily fixable?

In [3]:

```
stops.head(1)
```

Out[3]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |

In [4]:

```
stops.info()
```

- Are there suspicious values?
- If a value is suspicious, can we trust the observation?
- For example, consider
`'subject_age'`

– some are too high to be true, some are too low to be true.

In [5]:

```
stops['subject_age'].unique()
```

Out[5]:

array(['28', '25', '31', '29', '52', '24', '20', '50', '23', '54', '53', '35', '38', '18', '48', '68', '45', '63', '49', '42', '27', '19', '55', '32', '47', '33', '41', '59', '60', '58', '26', '36', '40', '39', '21', '64', '30', '43', '17', '51', '34', '56', '44', '22', '69', '46', '16', '57', '37', '65', '72', '67', '66', '70', '62', '73', '74', '0', '77', nan, '89', '79', '61', '78', '99', '75', '85', '82', '71', '15', '80', '81', '93', '84', '76', '2', '4', '86', '91', '83', '88', '98', '87', 'No Age', '9', '100', '14', '95', '96', '92', '119', '1', '90', '163', '5', '114', '94', '10', '212', '220', '6', '145', '97', '120'], dtype=object)

In [6]:

```
ages = pd.to_numeric(stops['subject_age'], errors='coerce')
ages.describe()
```

Out[6]:

count 99648.000000 mean 37.277697 std 14.456934 min 0.000000 25% 25.000000 50% 34.000000 75% 47.000000 max 220.000000 Name: subject_age, dtype: float64

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

In [7]:

```
stops.loc[ages > 100]
```

Out[7]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

25029 | 1333254 | Equipment Violation | 820 | W | F | 119 | 2016-03-19 22:20:00 | 2016-03-19 | 22:20 | N | N | N | NaN | NaN | NaN |

34179 | 1375483 | Moving Violation | 510 | W | M | 163 | 2016-04-18 16:35:00 | 2016-04-18 | 16:35 | N | N | N | NaN | NaN | NaN |

36968 | 1378369 | Moving Violation | 240 | A | F | 114 | 2016-04-28 11:44:00 | 2016-04-28 | 11:44 | NaN | NaN | NaN | NaN | NaN | NaN |

63570 | 1405478 | Moving Violation | 110 | V | M | 212 | 2016-08-04 18:10:00 | 2016-08-04 | 18:10 | Y | N | N | NaN | NaN | NaN |

70267 | 1411694 | Moving Violation | 310 | H | F | 220 | 2016-08-30 18:28:00 | 2016-08-30 | 18:28 | Y | N | N | NaN | NaN | NaN |

77038 | 1418885 | Moving Violation | 830 | B | M | 145 | 2016-09-22 20:35:00 | 2016-09-22 | 20:35 | Y | N | Y | N | N | NaN |

99449 | 1440889 | Equipment Violation | 120 | W | F | 120 | 2016-12-15 19:28:00 | 2016-12-15 | 19:28 | Y | N | N | NaN | NaN | NaN |

In [8]:

```
ages.loc[(ages >= 0) & (ages < 16)].value_counts()
```

Out[8]:

0.0 218 15.0 27 2.0 6 14.0 5 4.0 4 1.0 2 10.0 2 9.0 1 5.0 1 6.0 1 Name: subject_age, dtype: int64

In [9]:

```
stops.loc[(ages >= 0) & (ages < 16)]
```

Out[9]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

547 | 1308628 | Moving Violation | 120 | W | M | 0 | 2016-01-03 16:20:00 | 2016-01-03 | 16:20 | NaN | NaN | NaN | NaN | NaN | NaN |

747 | 1308820 | Moving Violation | Unknown | H | F | 0 | 2016-01-04 22:13:00 | 2016-01-04 | 22:13 | N | N | N | NaN | NaN | NaN |

1686 | 1309896 | Moving Violation | 120 | W | M | 15 | 2016-01-08 19:30:00 | 2016-01-08 | 19:30 | N | N | N | NaN | NaN | NaN |

1752 | 1309975 | Moving Violation | Unknown | O | M | 0 | 2016-01-08 23:20:00 | 2016-01-08 | 23:20 | N | N | N | NaN | NaN | NaN |

2054 | 1310377 | Equipment Violation | 430 | H | F | 0 | 2016-01-10 16:42:00 | 2016-01-10 | 16:42 | Y | N | N | NaN | NaN | NaN |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

100465 | 1442806 | Moving Violation | 810 | O | M | 15 | 2016-12-20 10:45:00 | 2016-12-20 | 10:45 | Y | N | N | NaN | NaN | NaN |

101495 | 1443500 | Moving Violation | 320 | W | M | 0 | 2016-12-23 22:30:00 | 2016-12-23 | 22:30 | NaN | NaN | NaN | NaN | NaN | NaN |

101784 | 1443745 | Moving Violation | Unknown | O | M | 0 | 2016-12-26 19:30:00 | 2016-12-26 | 19:30 | NaN | NaN | NaN | NaN | NaN | NaN |

101790 | 1443747 | Moving Violation | 320 | W | F | 0 | 2016-12-26 20:00:00 | 2016-12-26 | 20:00 | NaN | NaN | NaN | NaN | NaN | NaN |

102848 | 1444647 | Moving Violation | 320 | W | M | 0 | 2016-12-31 17:45:00 | 2016-12-31 | 17:45 | NaN | NaN | NaN | NaN | NaN | NaN |

267 rows × 15 columns

`'subject_age'`

¶- Ages of
`'No Age'`

and`0`

are likely explicit null values. - What do we do about the exceptionally small and large ages?
- Do we throw the entire row away, even if the rest of row is well-formed?

- What about the 14 and 15 year olds?
- Each has more than one occurrence – these could be real entries!

- Which fields were likely entered by a human?
- Which fields were likely generated by code?
- What was the original source?

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

In [10]:

```
stops['stop_cause'].value_counts()
```

Out[10]:

Moving Violation 75200 Equipment Violation 26234 Radio Call/Citizen Contact 571 Muni, County, H&S Code 319 Personal Knowledge/Informant 289 No Cause Specified on a Card 184 Suspect Info (I.S., Bulletin, Log) 181 Personal Observ/Knowledge 45 MUNI, County, H&S Code 14 Bicycle 2 Suspect Info 2 BICYCLE 1 B & P 1 Bicycle Bicycle 1 Name: stop_cause, dtype: int64

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice?

In [11]:

```
# DSC 10 review: what does density=True do?
ages.loc[(ages > 15) & (ages <= 85)].plot(kind='hist', density=True, bins=70, ec='w');
```

Now let's look at the first few and last few rows of `stops`

.

In [12]:

```
stops[['timestamp', 'stop_date', 'stop_time']].head()
```

Out[12]:

timestamp | stop_date | stop_time | |
---|---|---|---|

0 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 |

1 | 2016-01-01 00:10:00 | 2016-01-01 | 0:10 |

2 | 2016-01-01 00:14:00 | 2016-01-01 | 0:14 |

3 | 2016-01-01 00:16:00 | 2016-01-01 | 0:16 |

4 | 2016-01-01 00:30:00 | 2016-01-01 | 0:30 |

In [13]:

```
stops[['timestamp', 'stop_date', 'stop_time']].tail(10)
```

Out[13]:

timestamp | stop_date | stop_time | |
---|---|---|---|

103041 | NaN | 2016-12-13 | 0.398611111 |

103042 | NaN | 2016-12-15 | 0.383333333 |

103043 | NaN | 2016-12-15 | 0.397916667 |

103044 | NaN | 2016-12-19 | 0:91 |

103045 | NaN | 2016-12-20 | 0:82 |

103046 | NaN | 2016-12-20 | 0.397916667 |

103047 | NaN | 2016-12-21 | 0:73 |

103048 | NaN | 2016-12-21 | 0:94 |

103049 | NaN | 2016-12-29 | 0:81 |

103050 | NaN | 2016-12-29 | -0:81 |

Do you think `'-0:81'`

is a time that a computer would record?

- Unfaithful data are data that don't accurately represent the data generating process.
- Outliers are "unusual" observations, unlike the rest of the data. They may be real, or they may be unfaithful.
- For instance, it's possible that a 102-year old was pulled over for speeding.

- The two are hard to tell apart; doing so often requires research and domain knowledge.

**Consistently "incorrect" values**.- Example: Recorded ages of -1 or 99.
- These are often "default" values, often used when a value is missing.
- Solution: Change the value to the correct one if it is known!

**Abnormal artifacts from the data collection process**.- Example: Spikes in recorded ages at round numbers (25, 30, 35, 40), or spikes in recorded COVID cases on Mondays.
- Solution: Try "smoothing", e.g. binning the ages.

**Unreasonable outliers**.- Example: Age of 200.
- Solution: Not sure. Could remove the row. Could be indicative of a bug in the data collection process. Could be real!

You'll use the following methods regularly when initially exploring a dataset.

`.describe()`

: see basic numerical information about a Series/DataFrame.`.info()`

: see data types and the number of missing values in a Series/DataFrame.`.value_counts()`

: see the distribution of a categorical variable.`.plot(kind='hist')`

: plot the distribution of a numerical variable.

- Missing values in a dataset can occur from:
- Intentional logic, where a value doesn't make sense.
- A non-response in the measurement process.
- Mistakes in the data recording process.
- ...

Another term for "missing" is "null".

Missing values are most often encoded with

`NULL`

,`None`

,`NaN`

,`''`

, etc.

- All forms of
`0`

(e.g.`0`

,`'0'`

,`'zero'`

) are common substitutes for null. - -1 is common if a column must be non-negative.
- 1900 and 1970 are common if a non-null date is required.
- Remember, Unix time starts counting from January 1, 1970.

- Some common representations for "null" are also real values themselves!
- For instance, the point 0°00'00.0"N+0°00'00.0"E in the South Atlantic Ocean is called "Null Island."

What are the non-`np.NaN`

null values in the stops dataset?

- Service Area:
`'Unknown'`

. - Subject Age:
`0`

,`'No Age'`

. - Others?

In [14]:

```
stops
```

Out[14]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |

1 | 1308172 | Moving Violation | 520 | B | M | 25 | 2016-01-01 00:10:00 | 2016-01-01 | 0:10 | N | N | N | NaN | NaN | NaN |

2 | 1308171 | Moving Violation | 110 | H | F | 31 | 2016-01-01 00:14:00 | 2016-01-01 | 0:14 | NaN | NaN | NaN | NaN | NaN | NaN |

3 | 1308170 | Moving Violation | Unknown | W | F | 29 | 2016-01-01 00:16:00 | 2016-01-01 | 0:16 | N | N | N | NaN | NaN | NaN |

4 | 1308197 | Moving Violation | 230 | W | M | 52 | 2016-01-01 00:30:00 | 2016-01-01 | 0:30 | N | N | N | NaN | NaN | NaN |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

103046 | 1443046 | Moving Violation | 110 | H | M | 40 | NaN | 2016-12-20 | 0.397916667 | Y | N | N | NaN | NaN | NaN |

103047 | 1443132 | Moving Violation | 310 | W | M | No Age | NaN | 2016-12-21 | 0:73 | N | N | N | NaN | NaN | NaN |

103048 | 1443169 | Moving Violation | Unknown | H | M | 26 | NaN | 2016-12-21 | 0:94 | Y | N | N | NaN | NaN | NaN |

103049 | 1447090 | Moving Violation | 430 | F | F | 63 | NaN | 2016-12-29 | 0:81 | N | N | N | NaN | NaN | NaN |

103050 | 1445548 | Moving Violation | 320 | O | F | 34 | NaN | 2016-12-29 | -0:81 | Y | N | N | NaN | NaN | NaN |

103051 rows × 15 columns

`pandas`

¶- Null values are encoded using NumPy's
`NaN`

value, which is of type`float`

. - The
`isna`

method for DataFrame/Series detects missing values.- It returns a Boolean DataFrame/Series.
`isnull`

is equivalent to`isna`

.

In [15]:

```
type(np.NaN)
```

Out[15]:

float

In [16]:

```
# All of the rows where the subject age is missing
stops[stops['subject_age'].isna()]
```

Out[16]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

606 | 1309398 | Moving Violation | Unknown | H | M | NaN | 2016-01-04 00:01:00 | 2016-01-04 | 0:01 | N | N | N | NaN | NaN | NaN |

607 | 1309404 | Moving Violation | 510 | W | M | NaN | 2016-01-04 00:01:00 | 2016-01-04 | 0:01 | Y | N | N | NaN | NaN | NaN |

653 | 1309400 | Moving Violation | 620 | W | F | NaN | 2016-01-04 13:00:00 | 2016-01-04 | 13:00 | Y | N | N | NaN | NaN | NaN |

658 | 1309399 | Moving Violation | 620 | W | F | NaN | 2016-01-04 13:55:00 | 2016-01-04 | 13:55 | Y | N | N | NaN | NaN | NaN |

816 | 1309424 | Moving Violation | 620 | W | F | NaN | 2016-01-05 08:30:00 | 2016-01-05 | 8:30 | N | N | N | NaN | NaN | NaN |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

102989 | 1412614 | Moving Violation | 120 | W | F | NaN | NaN | 2016-08-30 | 24:40:00 | Y | N | N | NaN | NaN | NaN |

102990 | 1414071 | Moving Violation | 110 | W | M | NaN | NaN | 2016-08-31 | 0:91 | N | N | N | NaN | NaN | NaN |

102991 | 1414020 | Moving Violation | 930 | O | F | NaN | NaN | 2016-08-31 | 98:35:00 | Y | N | N | NaN | NaN | NaN |

103002 | 1420099 | Moving Violation | 610 | H | M | NaN | NaN | 2016-09-27 | 0.565972222 | N | N | N | NaN | NaN | NaN |

103005 | 1422234 | Moving Violation | 620 | W | M | NaN | NaN | 2016-10-05 | 0.417361111 | N | N | N | NaN | NaN | NaN |

2767 rows × 15 columns

In [17]:

```
# Proportion of values missing in the subject_age column
stops['subject_age'].isna().mean()
```

Out[17]:

0.026850782622196777

In [18]:

```
# Proportion of missing values in all columns
stops.isna().mean()
```

Out[18]:

stop_id 0.000000 stop_cause 0.000068 service_area 0.000000 subject_race 0.001271 subject_sex 0.001805 subject_age 0.026851 timestamp 0.001669 stop_date 0.000000 stop_time 0.000000 sd_resident 0.187888 arrested 0.180988 searched 0.191371 obtained_consent 0.953508 contraband_found 0.951781 property_seized 0.952218 dtype: float64

- The
`dropna`

method:- when used on a Series, returns a new Series with all null entries removed.
- when used on a DataFrame, returns a new DataFrame where all rows with at least one null value are removed.

- Don't drop rows unless absolutely necessary!
- Usually, there is still useful information in the other columns.

In [19]:

```
stops.head()
```

Out[19]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |

1 | 1308172 | Moving Violation | 520 | B | M | 25 | 2016-01-01 00:10:00 | 2016-01-01 | 0:10 | N | N | N | NaN | NaN | NaN |

2 | 1308171 | Moving Violation | 110 | H | F | 31 | 2016-01-01 00:14:00 | 2016-01-01 | 0:14 | NaN | NaN | NaN | NaN | NaN | NaN |

3 | 1308170 | Moving Violation | Unknown | W | F | 29 | 2016-01-01 00:16:00 | 2016-01-01 | 0:16 | N | N | N | NaN | NaN | NaN |

4 | 1308197 | Moving Violation | 230 | W | M | 52 | 2016-01-01 00:30:00 | 2016-01-01 | 0:30 | N | N | N | NaN | NaN | NaN |

In [20]:

```
stops.dropna().head()
```

Out[20]:

stop_id | stop_cause | service_area | subject_race | subject_sex | subject_age | timestamp | stop_date | stop_time | sd_resident | arrested | searched | obtained_consent | contraband_found | property_seized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1308198 | Equipment Violation | 530 | W | M | 28 | 2016-01-01 00:06:00 | 2016-01-01 | 0:06 | Y | N | N | N | N | N |

8 | 1308979 | Moving Violation | 310 | H | F | 25 | 2016-01-01 01:03:00 | 2016-01-01 | 1:03 | Y | N | Y | N | N | N |

20 | 1308187 | Equipment Violation | 510 | H | M | 31 | 2016-01-01 08:15:00 | 2016-01-01 | 8:15 | Y | N | Y | N | Y | Y |

21 | 1308186 | Moving Violation | 710 | H | F | 48 | 2016-01-01 08:21:00 | 2016-01-01 | 8:21 | N | N | N | N | N | N |

39 | 1308193 | Moving Violation | 710 | H | F | 32 | 2016-01-01 11:09:00 | 2016-01-01 | 11:09 | N | N | N | N | N | N |

In [21]:

```
stops.shape
```

Out[21]:

(103051, 15)

In [22]:

```
stops.dropna().shape
```

Out[22]:

(4619, 15)

When used on a DataFrame:

`.dropna()`

drops**rows**containing**at least one**null value.`.dropna(how='all')`

drops**rows**containing**only**null values.`.dropna(axis=1)`

drops**columns**containing at least one null value.- Other keyword arguments:
`thresh`

,`subset`

.

In [23]:

```
nans = pd.DataFrame([[0, 1, np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]], columns='A B C'.split())
nans
```

Out[23]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

1 | NaN | NaN | NaN |

2 | 1.0 | 2.0 | 3.0 |

In [24]:

```
nans.dropna(how='any')
```

Out[24]:

A | B | C | |
---|---|---|---|

2 | 1.0 | 2.0 | 3.0 |

In [25]:

```
nans.dropna(how='all')
```

Out[25]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

2 | 1.0 | 2.0 | 3.0 |

In [26]:

```
nans.dropna(axis=1)
```

Out[26]:

0 |
---|

1 |

2 |

In [27]:

```
nans.dropna(subset=['A', 'B'])
```

Out[27]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

2 | 1.0 | 2.0 | 3.0 |

The `fillna`

method replaces all null values. Specifically:

`.fillna(val)`

fills null entries with the value`val`

.`.fillna(dict)`

fills null entries using a dictionary`dict`

of column/row values.`.fillna(method='bfill')`

and`.fillna(method='ffill')`

fill null entries using neighboring non-null entries.

In [28]:

```
nans
```

Out[28]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

1 | NaN | NaN | NaN |

2 | 1.0 | 2.0 | 3.0 |

In [29]:

```
# Filling all NaNs with the same value
nans.fillna('billy')
```

Out[29]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | billy |

1 | billy | billy | billy |

2 | 1.0 | 2.0 | 3.0 |

In [30]:

```
# Filling NaNs differently for each column
nans.fillna({'A': 'f0', 'B': 'f1', 'C': 'f2'})
```

Out[30]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | f2 |

1 | f0 | f1 | f2 |

2 | 1.0 | 2.0 | 3.0 |

In [31]:

```
# Dictionary of column means
# Note that most numerical methods ignore null values
means = {c: nans[c].mean() for c in nans.columns}
means
```

Out[31]:

{'A': 0.5, 'B': 1.5, 'C': 3.0}

In [32]:

```
# Filling NaNs with column means
nans.fillna(means)
```

Out[32]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | 3.0 |

1 | 0.5 | 1.5 | 3.0 |

2 | 1.0 | 2.0 | 3.0 |

In [33]:

```
# Another way of doing the same thing
nans.apply(lambda x: x.fillna(x.mean()), axis=0)
```

Out[33]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | 3.0 |

1 | 0.5 | 1.5 | 3.0 |

2 | 1.0 | 2.0 | 3.0 |

In [34]:

```
nans
```

Out[34]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

1 | NaN | NaN | NaN |

2 | 1.0 | 2.0 | 3.0 |

In [35]:

```
# bfill stands for "backfill"
nans.fillna(method='bfill')
```

Out[35]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | 3.0 |

1 | 1.0 | 2.0 | 3.0 |

2 | 1.0 | 2.0 | 3.0 |

In [36]:

```
# ffill stands for "forward fill"
nans.fillna(method='ffill')
```

Out[36]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

1 | 0.0 | 1.0 | NaN |

2 | 1.0 | 2.0 | 3.0 |

`np.NaN`

¶- The result of
*any*comparison (`==`

,`!=`

,`<`

,`>`

) with`np.NaN`

is`False`

. - Instead, use the function
`pd.isna`

, which returns whether the argument is`np.NaN`

or`None`

.- Can also use
`pd.isnull`

.

- Can also use
- Remember,
`NaN`

is of type`float`

– watch out for type coercion!

In [37]:

```
nans
```

Out[37]:

A | B | C | |
---|---|---|---|

0 | 0.0 | 1.0 | NaN |

1 | NaN | NaN | NaN |

2 | 1.0 | 2.0 | 3.0 |

In [38]:

```
np.NaN == np.NaN
```

Out[38]:

False

In [39]:

```
pd.isna(np.NaN)
```

Out[39]:

True

In [40]:

```
nans.isna()
```

Out[40]:

A | B | C | |
---|---|---|---|

0 | False | False | True |

1 | True | True | True |

2 | False | False | False |

In [41]:

```
nans.isnull()
```

Out[41]:

A | B | C | |
---|---|---|---|

0 | False | False | True |

1 | True | True | True |

2 | False | False | False |

In [42]:

```
# Since np.NaN is a float, the Series is of type float despite the two ints
pd.Series([0, 1, np.NaN])
```

Out[42]:

0 0.0 1 1.0 2 NaN dtype: float64

- That's all we'll discuss regarding missing values for now.
- However, once we recap hypothesis and permutation testing, we will introduce the idea of
**imputation**, in which we will learn how to fill missing values using other information in the DataFrame. - Stay tuned!

Now our data is clean and we're confident that it's faithful to the data generating process.

How do we ask questions and draw conclusions about the data generating process, using our observed data?

Run the following cell to set things up.

In [43]:

```
np.random.seed(42)
flips = pd.DataFrame(np.random.choice(['H', 'T'], p=[0.55, 0.45], size=(114,1)), columns=['result'])
```

- Given a dataset of coin flips, we want to try and answer the question, "was the coin fair?"
- Do we "trust" the dataset?
- Maybe whoever kept track of the coin flips made some typos.

- What is "fair"?
- Ideally, we see the exact same number of heads and tails. But how often will that happen exactly?
- What is a reasonable deviation?

In [44]:

```
flips.head()
```

Out[44]:

result | |
---|---|

0 | H |

1 | T |

2 | T |

3 | T |

4 | H |

In [45]:

```
flips.value_counts()
```

Out[45]:

result H 68 T 46 dtype: int64

In [46]:

```
# The to_frame method converts a Series to a DataFrame
flips['result'].value_counts().to_frame()
```

Out[46]:

result | |
---|---|

H | 68 |

T | 46 |

In [47]:

```
# Normalized
flips['result'].value_counts(normalize=True).to_frame()
```

Out[47]:

result | |
---|---|

H | 0.596491 |

T | 0.403509 |

- We start with an initial belief as to how the data was generated, which is called a
**null hypothesis**.- In our example, it is that the coin was fair.
- The null hypothesis must be a
**probability model**, i.e. something that we can simulate under.

- Somehow, we need to decide whether our observation (e.g. 68 heads and 46 tails) is consistent with that belief.
- To make this decision, we will:
- Assume the belief is true.
- Consider all possible outcomes under that assumption, along with their probabilities.
- e.g. if the coin truly was fair, what's the probability of seeing 40% heads? 61% heads? 49% heads?

- See how likely our observation was, under this assumption.

A

**test statistic**is a number that we compute in each repetition of an experiment, to help us make a decision.Suppose a coin was flipped $N$ times, and $N_H$ flips were heads. Then, each of the following is a test statistic we could choose:

- $N_H$ (number of heads).
- $\frac{N_H}{N}$ (proportion of heads).
- $N_H - \frac{N}{2}$ (difference from expected number of heads).
- $|N_H - \frac{N}{2}|$ (absolute difference from expected number of heads).

The former three would be helpful for the alternative hypothesis "the coin was biased in favor of heads" (or tails).

- The latter would be helpful for the alternative hypothesis "the coin was biased."

- After choosing a test statistic, we need to compute the
**distribution of the test statistic, under the assumption that the null hypothesis is true**("under the null").- In DSC 10 and 80, we do this through simulation, which means our calculations are approximate.
- In other courses, you may do this by-hand (e.g. for the coin example you could use the binomial distribution).

- Once we have this distribution, we can compute
**the probability of seeing an observation as or more extreme than our observation**, under this assumption.- This is called a
**p-value**.

- This is called a
- If that probability is very small, it means that the null hypothesis is unlikely to explain our observation, and we should reject it.

Let's use the number of heads ($N_H$) as our test statistic. We need to:

- Compute the
**observed value**of the test statistic, i.e. the observed number of heads. - Simulate values of the test statistic under the null, i.e. under the assumption that the coin was fair.
- Use the resulting distribution to calculate the (approximate) probability of seeing 68 or more heads, under the assumption the coin was fair.

In [48]:

```
# This DataFrame contains our "observed data"
flips.head()
```

Out[48]:

result | |
---|---|

0 | H |

1 | T |

2 | T |

3 | T |

4 | H |

In [49]:

```
# Number of coin flips
flips.shape
```

Out[49]:

(114, 1)

In [50]:

```
# Observed statistic
obs = (flips['result'] == 'H').sum()
obs
```

Out[50]:

68

In [51]:

```
# Number of simulations
N = 10000
# 10000 times, we want to flip a coin 114 times
results = []
for _ in range(N):
simulation = np.random.choice(['H', 'T'], p=[0.5, 0.5], size=114)
sim_heads = (simulation == 'H').sum() # Test statistic
results.append(sim_heads)
```

Each entry in `results`

is the number of heads in 114 simulated coin flips.

In [52]:

```
results[:10]
```

Out[52]:

[55, 50, 55, 48, 70, 53, 65, 55, 57, 48]

In [53]:

```
pd.Series(results).plot(kind='hist',
density=True,
bins=np.arange(35, 76, 1),
ec='w',
title='Number of Heads in 114 Flips of a Fair Coin');
plt.axvline(x=obs, color='red', linewidth=4);
```

**Question:** Do you think the coin was fair?

In [54]:

```
(np.array(results) >= obs).mean()
```

Out[54]:

0.0244

- Under the assumption the coin is fair, the probability of seeing 68 or more heads is ~2.5%.
- This is called a
**p-value**.

- This is called a
- So either:
- The coin is fair and we saw a really rare event, or
- The coin is not fair.

- We need a
**cutoff**to determine whether to reject the null hypothesis, given this probability.

- Data cleaning is the process of transforming data so that it is an accurate representation of the data generating process.
- Unfaithful data is data that is not representative of the data generating process. When working with messy data, we must look for:
- Missing values (i.e. "null" values).
- Incorrect values.

- Useful methods to be aware of:
`fillna`

,`isna`

/`isnull`

,`dropna`

. - Hypothesis testing allows us to make confident conclusions regarding the data generating process, given some observed data.
**Next time:**how to perform a "faster" hypothesis test. More test statistics and examples.