Lecture 4 – Messy Data

DSC 80, Spring 2022

Announcements

Agenda

Recap: adding columns

Adding and modifying columns, using a copy

As an aside, you should try your best to write chained pandas code, as follows:

You can chain together several steps at a time:

You can also use assign when the desired column name has spaces, by using keyword arguments.

Adding and modifying columns, in-place

Introduction to messy data

There is no such thing as "raw data"!

Data generating process

Example: COVID case counts 🦠

Suppose our goal is to determine the number of COVID cases in the US yesterday.

Why do you think so few cases were reported on Christmas Day – is it because COVID was less prevalent on Christmas Day as compared to the days before and after, or is it likely for some other reason? 🎅

Data provenance

The bigger picture question we're asking here is, can we trust our data?

Data cleaning 🧹

Keys to data cleaning

Data cleaning often addresses:

Let's focus on the latter two.

Kinds of data

Kinds of data

Discussion Question

Determine the kind of each of the following variables.

Example: DSC 80 students

In the next cell, we'll load in an example dataset containing information about past DSC 80 students.

What needs to be changed in the DataFrame to compute statistics?

How much has each student paid in total tuition in 2021 and 2022?

Check the data types of students!

Cleaning '2021 tuition' and '2022 tuition'

We can loop through the columns of students to apply the above procedure. (Looping through columns is fine, just avoid looping through rows.)

Alternatively, we can do this without a loop by using str.contains to find only the columns that contain tuition information.

Cleaning 'Paid'

Cleaning 'Month', 'Day', and 'Year'

Note:

Cleaning 'DSC 80 Final Grade'

Cleaning 'Student Name'

Aside: string methods

str methods are useful – use them!

More data type ambiguities

Example: the Norway problem 🇳🇴

Unfaithful data

Is the data "faithful" to the DGP?

Is the data "faithful" to the DGP?

Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

General questions

  1. Check the data types. Notice any issues?
  2. Do string fields have consistent values?
  3. Are there missing values that we don't understand?
  4. Are all values within a reasonable range?
  5. How do we deal with the messiness we find?

Data types

Unfaithfulness

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

Unfaithful 'subject_age'

In the coming weeks, we'll cover more solutions to these problems.

Summary, next time

Summary