In [1]:
from dsc80_utils import *
This notebook contains code (e.g. the answers to exercises) that was written live during Lecture 5. If you haven't already watched and work through this lecture, you might find it more beneficial to look at the "blank" version of this lecture and answer the exercises yourself.

Lecture 5 – Exploratory Data Analysis and Data Cleaning¶

DSC 80, Winter 2024¶

Announcements 📣¶

  • Project 1 is due on Saturday, January 27th.
  • Lab 3 will be released tonight and will be due on Monday, January 29th.
  • If you submitted Lab 2, make sure to attend discussion tomorrow and submit the reflection form on Gradescope by this Thursday for extra credit!
    • Plans for discussion tomorrow, in addition to lab review: the difference between int and np.int64, a review of the various DataFrameGroupBy methods (agg, transform, filter, apply).

Agenda 📆¶

  • Other data representations.
  • Dataset overview.
  • Introduction to plotly.
  • Exploratory data analysis and feature types.
  • Data cleaning.
    • Data quality checks.
    • Missing values.
    • Transformations and timestamps.
    • Modifying structure.
  • Investigating student-submitted questions!

Question 🤔 (Answer at q.dsc80.com)

Remember, you can always ask questions at q.dsc80.com! If the link doesn't work for you, click the 🤔 Lecture Questions link in the top right corner of the course website.

Other data representations¶

Representations of tabular data¶

  • In DSC 80, we work with DataFrames in pandas.
    • When we say pandas DataFrame, we're talking about the pandas API for its DataFrame objects.
      • API stands for "application programming interface." We'll learn about these more soon.
    • When we say "DataFrame", we're referring to a general way to represent data (rows and columns, with labels for both rows and columns).
  • There many other ways to work with data tables!
    • Examples: R data frames, SQL databases, spreadsheets, or even matrices from linear algebra.
    • When you learn SQL in DSC 100, you'll find many similaries (e.g. slicing columns, filtering rows, grouping, joining, etc.).
    • Relational algebra captures common data operations between many data table systems.
  • Why use DataFrames over something else?

DataFrames vs. spreadsheets¶

  • DataFrames give us a data lineage: the code records down data changes. Not so in spreadsheets!
  • Using a general-purpose programming language gives us the ability to handle much larger datasets, and we can use distributed computing systems to handle massive datasets.

DataFrames vs. matrices¶

\begin{split} \begin{aligned} \mathbf{X} = \begin{bmatrix} 1 & 0 \\ 0 & 4 \\ 0 & 0 \\ \end{bmatrix} \end{aligned} \end{split}
  • Matrices are mathematical objects. They only hold numbers, but have many useful properties (which you've learned about in your linear algebra class, Math 18).
  • Often, we process data from a DataFrame into matrix format for machine learning models. You saw this a bit in DSC 40A, and we'll see this more in DSC 80 in a few weeks.

DataFrames vs. relations¶

  • Relations are the data representation for relational database systems (e.g. MySQL, PostgreSQL, etc.).
  • You'll learn all about these in DSC 100.
  • Database systems are much better than DataFrames at storing many data tables and handling concurrency (many people reading and writing data at the same time).
  • Common workflow: load a subset of data in from a database system into pandas, then make a plot.
  • Or: load and clean data in pandas, then store it in a database system for others to use.

Dataset overview¶

San Diego food safety¶

From this article (archive link):

In the last three years, one third of San Diego County restaurants have had at least one major food safety violation.

99% Of San Diego Restaurants Earn ‘A' Grades, Bringing Usefulness of System Into Question¶

From this article (archive link):

Food held at unsafe temperatures. Employees not washing their hands. Dirty countertops. Vermin in the kitchen. An expired restaurant permit.

Restaurant inspectors for San Diego County found these violations during a routine health inspection of a diner in La Mesa in November 2016. Despite the violations, the restaurant was awarded a score of 90 out of 100, the lowest possible score to achieve an ‘A’ grade.

The data¶

  • We downloaded the data about the 1000 restaurants closest to UCSD from here.
  • We had to download the data as JSON files, then process it into DataFrames. You'll learn how to do this soon!
    • Until now, you've (largely) been presented with CSV files that pd.read_csv could load without any issues.
    • But there are many different formats and possible issues when loading data in from files.
    • See Chapter 8 of Learning DS for more.
In [2]:
rest_path = Path('data') / 'restaurants.csv'
insp_path = Path('data') / 'inspections.csv'
viol_path = Path('data') / 'violations.csv'
In [3]:
rest = pd.read_csv(rest_path)
insp = pd.read_csv(insp_path)
viol = pd.read_csv(viol_path)

Exercise

The first article said that one third of restaurants had at least one major safety violation.
Which DataFrames and columns seem most useful to verify this?
In [4]:
rest
Out[4]:
business_id name business_type address ... lat long opened_date distance
0 211898487641 MOBIL MART LA JOLLA VILLAGE Pre-Packaged Retail Market 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 ... 32.87 -117.23 2002-05-05 0.62
1 211930769329 CAFE 477 Low Risk Food Facility 8950 VILLA LA JOLLA DR, SUITE# B123, LA JOLLA,... ... 32.87 -117.24 2023-07-24 0.64
2 211909057778 VALLEY FARM MARKET Retail Market with Deli 6902 LA JOLLA BLVD, LA JOLLA, CA 92037 ... 32.87 -117.24 2019-01-22 0.64
... ... ... ... ... ... ... ... ... ...
997 211899338714 PACIFIC BEACH ELEMENTARY School Processing Food Facility 1234 TOURMALINE ST, SAN DIEGO, CA 92109-1856 ... 32.81 -117.25 2002-05-05 4.97
998 211942150255 POKEWAN DEL MAR Restaurant Food Facility 12925 EL CAMINO REAL, SUITE# AA4, SAN DIEGO, C... ... 32.95 -117.23 2016-11-03 4.97
999 211925713322 SAFFRONO LOUNGE RESTAURANT Restaurant Food Facility 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 ... 32.95 -117.25 2022-11-03 4.97

1000 rows × 12 columns

In [5]:
rest.columns
Out[5]:
Index(['business_id', 'name', 'business_type', 'address', 'city', 'zip',
       'phone', 'status', 'lat', 'long', 'opened_date', 'distance'],
      dtype='object')
In [6]:
insp.head()
Out[6]:
custom_id business_id inspection_id description ... completed_date status link status_link
0 DEH2002-FFPN-310012 211898487641 6886133 NaN ... 2023-02-16 Complete http://www.sandiegocounty.gov/deh/fhd/ffis/ins... http://www.sandiegocounty.gov/deh/fhd/ffis/ins...
1 DEH2002-FFPN-310012 211898487641 6631228 NaN ... 2022-01-03 Complete http://www.sandiegocounty.gov/deh/fhd/ffis/ins... http://www.sandiegocounty.gov/deh/fhd/ffis/ins...
2 DEH2002-FFPN-310012 211898487641 6357338 NaN ... 2020-12-03 Complete http://www.sandiegocounty.gov/deh/fhd/ffis/ins... http://www.sandiegocounty.gov/deh/fhd/ffis/ins...
3 DEH2023-FFPP-016887 211930769329 7329834 NaN ... 2023-09-20 Complete http://www.sandiegocounty.gov/deh/fhd/ffis/ins... http://www.sandiegocounty.gov/deh/fhd/ffis/ins...
4 DEH2019-FFPP-010654 211909057778 7233091 NaN ... 2023-05-26 Complete http://www.sandiegocounty.gov/deh/fhd/ffis/ins... http://www.sandiegocounty.gov/deh/fhd/ffis/ins...

5 rows × 11 columns

In [7]:
insp.columns
Out[7]:
Index(['custom_id', 'business_id', 'inspection_id', 'description', 'type',
       'score', 'grade', 'completed_date', 'status', 'link', 'status_link'],
      dtype='object')
In [8]:
viol.head()
Out[8]:
inspection_id violation major_violation status violation_text correction_type_link violation_accela link
0 6886133 Hot and Cold Water Y Out of Compliance - Major Hot and Cold Water http://www.sandiegocounty.gov/deh/fhd/ffis/vio... 21. Hot & cold water available http://www.sandiegocounty.gov/deh/fhd/ffis/vio...
1 6631228 Hot and Cold Water N Out of Compliance - Minor Hot and Cold Water http://www.sandiegocounty.gov/deh/fhd/ffis/vio... 21. Hot & cold water available http://www.sandiegocounty.gov/deh/fhd/ffis/vio...
2 6357338 Holding Temperatures N Out of Compliance - Minor Holding Temperatures http://www.sandiegocounty.gov/deh/fhd/ffis/vio... 7. Proper hot & cold holding temperatures http://www.sandiegocounty.gov/deh/fhd/ffis/vio...
3 6939628 Holding Temperatures Y Out of Compliance - Major Holding Temperatures http://www.sandiegocounty.gov/deh/fhd/ffis/vio... 7. Proper hot & cold holding temperatures http://www.sandiegocounty.gov/deh/fhd/ffis/vio...
4 6939628 Approved Procedures N Out of Compliance - Minor Approved Procedures http://www.sandiegocounty.gov/deh/fhd/ffis/vio... 18. Compliance with http://www.sandiegocounty.gov/deh/fhd/ffis/vio...
In [9]:
viol.columns
Out[9]:
Index(['inspection_id', 'violation', 'major_violation', 'status',
       'violation_text', 'correction_type_link', 'violation_accela', 'link'],
      dtype='object')

Introduction to plotly¶

plotly¶

  • We've used plotly in lecture briefly, and you even have to use it in Project 1 Question 13, but we haven't yet discussed it formally.
  • It's a visualization library that enables interactive visualizations.
No description has been provided for this image

Using plotly¶

There are a few ways we can use plotly:

  • Using the plotly.express syntax.
    • plotly is very flexible, but it can be verbose; plotly.express allows us to make plots quickly.
    • See the documentation here – it's very rich (there are good examples for almost everything).
  • By setting pandas plotting backend to 'plotly' (by default, it's 'matplotlib') and using the DataFrame plot method.
    • The DataFrame plot method is how you created plots in DSC 10!

For now, we'll use plotly.express syntax; we've imported it in the dsc80_utils.py file that we import at the top of each lecture notebook.

Initial plots¶

First, let's look at the distribution of inspection 'score's:

In [10]:
px.histogram(insp['score'])
In [11]:
insp.columns
Out[11]:
Index(['custom_id', 'business_id', 'inspection_id', 'description', 'type',
       'score', 'grade', 'completed_date', 'status', 'link', 'status_link'],
      dtype='object')

How about the distribution of average inspection 'score' per 'grade'?

In [12]:
scores = (
    insp[['grade', 'score']]
    .groupby('grade')
    .mean()
    .reset_index()
)
scores
Out[12]:
grade score
0 A 97.27
1 B 81.91
In [13]:
px.bar(scores, x='grade', y='score')
In [14]:
scores.plot(kind='bar', x='grade', y='score')

Exploratory data analysis and feature types¶

The data science lifecycle, revisited¶

No description has been provided for this image

We're at the stage of understanding the data.

Exploratory data analysis (EDA)¶

  • Historically, data analysis was dominated by formal statistics, including tools like confidence intervals, hypothesis tests, and statistical modeling.
  • In 1977, John Tukey defined the term exploratory data analysis, which described a philosophy for proceeding about data analysis:

Exploratory data analysis is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected.

  • Practically, EDA involves, among other things, computing summary statistics and drawing plots to understand the nature of the data at hand.

The greatest gains from data come from surprises… The unexpected is best brought to our attention by pictures.

Different feature types¶

No description has been provided for this image

Exercise

Determine the feature type of each of the following variables.
  • insp['score']
  • insp['grade']
  • viol['violation_accela']
  • viol['major_violation']
  • rest['business_id']
  • rest['opened_date']
In [15]:
# Your code goes here.
In [16]:
insp['score']
Out[16]:
0       96
1       98
2       98
        ..
5176     0
5177     0
5178    90
Name: score, Length: 5179, dtype: int64
In [17]:
insp['grade'].value_counts()
Out[17]:
A    2978
B      11
Name: grade, dtype: int64
In [18]:
viol['violation_accela']
Out[18]:
0                          21. Hot & cold water available
1                          21. Hot & cold water available
2               7. Proper hot & cold holding temperatures
                              ...                        
5967    35. Equipment / Utensils -approved, installed,...
5968    43. Toilet facilities -properly constructed, s...
5969    45. Floor, walls and ceilings - built, maintai...
Name: violation_accela, Length: 5970, dtype: object
In [19]:
viol['major_violation']
Out[19]:
0       Y
1       N
2       N
       ..
5967    N
5968    N
5969    N
Name: major_violation, Length: 5970, dtype: object
In [20]:
rest['business_id']
Out[20]:
0      211898487641
1      211930769329
2      211909057778
           ...     
997    211899338714
998    211942150255
999    211925713322
Name: business_id, Length: 1000, dtype: int64
In [21]:
rest['opened_date']
Out[21]:
0      2002-05-05
1      2023-07-24
2      2019-01-22
          ...    
997    2002-05-05
998    2016-11-03
999    2022-11-03
Name: opened_date, Length: 1000, dtype: object

Feature types vs. data types¶

  • The data type pandas uses is not the same as the "data type" we talked about just now!
    • There's a difference between feature type and computational data type.
  • Take care when the two don't match up very well!
In [22]:
# pandas stores these as ints, but they're actually nominal.
rest['business_id']
Out[22]:
0      211898487641
1      211930769329
2      211909057778
           ...     
997    211899338714
998    211942150255
999    211925713322
Name: business_id, Length: 1000, dtype: int64
In [23]:
# pandas stores these as strings, but they're actually numeric.
rest['opened_date']
Out[23]:
0      2002-05-05
1      2023-07-24
2      2019-01-22
          ...    
997    2002-05-05
998    2016-11-03
999    2022-11-03
Name: opened_date, Length: 1000, dtype: object

Data cleaning¶

Four pillars of data cleaning¶

When loading in a dataset, to clean the data – that is, to prepare it for further analysis – we will:

  1. Perform data quality checks.
  1. Identify and handle missing values.
  1. Perform transformations, including converting time series data to timestamps.
  1. Modify structure as necessary.

Data cleaning: Data quality checks¶

Data quality checks¶

We often start an analysis by checking the quality of the data.

  • Scope: Do the data match your understanding of the population?
  • Measurements and values: Are the values reasonable?
  • Relationships: Are related features in agreement?
  • Analysis: Which features might be useful in a future analysis?

Scope¶

Do the data match your understanding of the population?

We were told that we're only looking at the 1000 restaurants closest to UCSD, so the restaurants in rest should agree with that.

In [24]:
rest.sample(5)
Out[24]:
business_id name business_type address ... lat long opened_date distance
671 211988325240 MORELIANAS LA JOLLA Restaurant Food Facility 456 PEARL ST, LA JOLLA, CA 92037 ... 32.84 -117.28 2023-05-09 3.84
571 211903764423 CUVIER CLUB BY WEDGEWOOD WEDDINGS Restaurant Food Facility 7776 EADS AVE, LA JOLLA, CA 92037-4302 ... 32.84 -117.28 2023-05-22 3.51
722 211944067622 COPA VIDA CAFE Restaurant Food Facility 3721 VALLEY CENTRE DR, SAN DIEGO, CA 92130-3329 ... 32.94 -117.23 2020-01-31 3.90
464 211935879286 KEVIES KITCHEN Class A Cottage Food Operation 5371 COLE ST, SAN DIEGO, CA 92117-1120 ... 32.84 -117.20 2015-02-25 3.18
514 211973977434 PICNIC PEOPLE AND COWGIRLQ CATERING Caterer 6355 MARINDUSTRY DR, SUITE# B, SAN DIEGO, CA 9... ... 32.88 -117.18 2019-10-02 3.34

5 rows × 12 columns

Measurements and values¶

Are the values reasonable?

Do the values in the 'grade' column match what we'd expect grades to look like?

In [25]:
insp['grade'].value_counts()
Out[25]:
A    2978
B      11
Name: grade, dtype: int64

What kinds of information does the insp DataFrame hold?

In [26]:
insp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5179 entries, 0 to 5178
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   custom_id       5179 non-null   object 
 1   business_id     5179 non-null   int64  
 2   inspection_id   5179 non-null   int64  
 3   description     0 non-null      float64
 4   type            5179 non-null   object 
 5   score           5179 non-null   int64  
 6   grade           2989 non-null   object 
 7   completed_date  5179 non-null   object 
 8   status          5179 non-null   object 
 9   link            5179 non-null   object 
 10  status_link     5179 non-null   object 
dtypes: float64(1), int64(3), object(7)
memory usage: 445.2+ KB

What's going on in the 'address' column of rest?

In [27]:
# Are there multiple restaurants with the same address?
rest['address'].value_counts()
Out[27]:
5300 GRAND DEL MAR CT, SAN DIEGO, CA 92130       9
8657 VILLA LA JOLLA DR, LA JOLLA, CA 92037       8
4545 LA JOLLA VILLAGE DR, SAN DIEGO, CA 92122    8
                                                ..
3963 GOVERNOR DR, SAN DIEGO, CA 92122            1
4041 GOVERNOR DR, SAN DIEGO, CA 92122-2520       1
2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014       1
Name: address, Length: 863, dtype: int64
In [28]:
# Keeps all rows with duplicate addresses.
(
    rest
    .groupby('address')
    .filter(lambda df: df.shape[0] >= 2)
    .sort_values('address')
)
Out[28]:
business_id name business_type address ... lat long opened_date distance
406 211899308875 NASEEMS BAKERY & KABOB Restaurant Food Facility 10066 PACIFIC HEIGHTS BLVD, SAN DIEGO, CA 92121 ... 32.90 -117.19 2012-04-17 2.77
402 211898699154 HANAYA SUSHI CAFE Restaurant Food Facility 10066 PACIFIC HEIGHTS BLVD, SAN DIEGO, CA 92121 ... 32.90 -117.19 2011-03-22 2.77
401 211899558107 ARMANDOS MEXICAN FOOD Restaurant Food Facility 10066 PACIFIC HEIGHTS BLVD, SAN DIEGO, CA 92121 ... 32.90 -117.19 2005-06-28 2.77
... ... ... ... ... ... ... ... ... ...
575 211972411855 TARA HEATHER CAKE DESIGN Caterer 9932 MESA RIM RD, SUITE# A, SAN DIEGO, CA 9212... ... 32.90 -117.18 2014-04-24 3.51
344 211990537315 COMPASS GROUP FEDEX EXPRESS OLSON Pre-Packaged Retail Market 9999 OLSON DR, SAN DIEGO, CA 92121-2837 ... 32.89 -117.20 2022-10-19 2.27
343 211976587262 CANTEEN - FED EX OLSON Pre-Packaged Retail Market 9999 OLSON DR, SAN DIEGO, CA 92121-2837 ... 32.89 -117.20 2020-07-31 2.27

213 rows × 12 columns

In [29]:
# Does the same thing as above!
rest[rest.duplicated(subset=['address'], keep=False)].sort_values('address')
Out[29]:
business_id name business_type address ... lat long opened_date distance
406 211899308875 NASEEMS BAKERY & KABOB Restaurant Food Facility 10066 PACIFIC HEIGHTS BLVD, SAN DIEGO, CA 92121 ... 32.90 -117.19 2012-04-17 2.77
402 211898699154 HANAYA SUSHI CAFE Restaurant Food Facility 10066 PACIFIC HEIGHTS BLVD, SAN DIEGO, CA 92121 ... 32.90 -117.19 2011-03-22 2.77
401 211899558107 ARMANDOS MEXICAN FOOD Restaurant Food Facility 10066 PACIFIC HEIGHTS BLVD, SAN DIEGO, CA 92121 ... 32.90 -117.19 2005-06-28 2.77
... ... ... ... ... ... ... ... ... ...
575 211972411855 TARA HEATHER CAKE DESIGN Caterer 9932 MESA RIM RD, SUITE# A, SAN DIEGO, CA 9212... ... 32.90 -117.18 2014-04-24 3.51
344 211990537315 COMPASS GROUP FEDEX EXPRESS OLSON Pre-Packaged Retail Market 9999 OLSON DR, SAN DIEGO, CA 92121-2837 ... 32.89 -117.20 2022-10-19 2.27
343 211976587262 CANTEEN - FED EX OLSON Pre-Packaged Retail Market 9999 OLSON DR, SAN DIEGO, CA 92121-2837 ... 32.89 -117.20 2020-07-31 2.27

213 rows × 12 columns

Relationships¶

Are related features in agreement?

Do the 'address'es and 'zip' codes in rest match?

In [30]:
rest[['address', 'zip']]
Out[30]:
address zip
0 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037
1 8950 VILLA LA JOLLA DR, SUITE# B123, LA JOLLA,... 92037-1704
2 6902 LA JOLLA BLVD, LA JOLLA, CA 92037 92037
... ... ...
997 1234 TOURMALINE ST, SAN DIEGO, CA 92109-1856 92109-1856
998 12925 EL CAMINO REAL, SUITE# AA4, SAN DIEGO, C... 92130
999 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014

1000 rows × 2 columns

What about the 'score's and 'grade's in insp?

In [31]:
insp[['score', 'grade']]
Out[31]:
score grade
0 96 NaN
1 98 NaN
2 98 NaN
... ... ...
5176 0 NaN
5177 0 NaN
5178 90 A

5179 rows × 2 columns

Analysis¶

Which features might be useful in a future analysis?

  • We're most interested in:
    • These columns in the rest DataFrame: 'business_id', 'name', 'address', 'zip', and 'opened_date'.
    • These columns in the insp DataFrame: 'business_id', 'inspection_id', 'score', 'grade', 'completed_date', and 'status'.
    • These columns in the viol DataFrame: 'inspection_id', 'violation', 'major_violation', 'violation_text', and 'violation_accela'.
  • Also, let's rename a few columns to make them easier to work with.

💡 Pro-Tip: Using pipe¶

When we manipulate DataFrames, it's best to define individual functions for each step, then use the pipe method to chain them all together.

The pipe DataFrame method takes in a function, which itself takes in a DataFrame and returns a DataFrame.

  • In practice, we would add functions one by one to the top of a notebook, then pipe them all.
  • For today, will keep re-running pipe to show data cleaning process.
In [32]:
def subset_rest(rest):
    return rest[['business_id', 'name', 'address', 'zip', 'opened_date']]

rest = (
    pd.read_csv(rest_path)
    .pipe(subset_rest)
)
rest
Out[32]:
business_id name address zip opened_date
0 211898487641 MOBIL MART LA JOLLA VILLAGE 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037 2002-05-05
1 211930769329 CAFE 477 8950 VILLA LA JOLLA DR, SUITE# B123, LA JOLLA,... 92037-1704 2023-07-24
2 211909057778 VALLEY FARM MARKET 6902 LA JOLLA BLVD, LA JOLLA, CA 92037 92037 2019-01-22
... ... ... ... ... ...
997 211899338714 PACIFIC BEACH ELEMENTARY 1234 TOURMALINE ST, SAN DIEGO, CA 92109-1856 92109-1856 2002-05-05
998 211942150255 POKEWAN DEL MAR 12925 EL CAMINO REAL, SUITE# AA4, SAN DIEGO, C... 92130 2016-11-03
999 211925713322 SAFFRONO LOUNGE RESTAURANT 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014 2022-11-03

1000 rows × 5 columns

In [33]:
# Same as the above – but the above makes it easier to chain more .pipe calls afterwards.
subset_rest(pd.read_csv(rest_path))
Out[33]:
business_id name address zip opened_date
0 211898487641 MOBIL MART LA JOLLA VILLAGE 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037 2002-05-05
1 211930769329 CAFE 477 8950 VILLA LA JOLLA DR, SUITE# B123, LA JOLLA,... 92037-1704 2023-07-24
2 211909057778 VALLEY FARM MARKET 6902 LA JOLLA BLVD, LA JOLLA, CA 92037 92037 2019-01-22
... ... ... ... ... ...
997 211899338714 PACIFIC BEACH ELEMENTARY 1234 TOURMALINE ST, SAN DIEGO, CA 92109-1856 92109-1856 2002-05-05
998 211942150255 POKEWAN DEL MAR 12925 EL CAMINO REAL, SUITE# AA4, SAN DIEGO, C... 92130 2016-11-03
999 211925713322 SAFFRONO LOUNGE RESTAURANT 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014 2022-11-03

1000 rows × 5 columns

Let's use pipe to keep (and rename) the subset of the columns we care about in the other two DataFrames as well.

In [34]:
def subset_insp(insp):
    return (
        insp[['business_id', 'inspection_id', 'score', 'grade', 'completed_date', 'status']]
        .rename(columns={'completed_date': 'date'})
    )

insp = (
    pd.read_csv(insp_path)
    .pipe(subset_insp)
)
In [35]:
def subset_viol(viol):
    return (
        viol[['inspection_id', 'violation', 'major_violation', 'violation_accela']]
        .rename(columns={'violation': 'kind',
                         'major_violation': 'is_major',
                         'violation_accela': 'violation'})
    )

viol = (
    pd.read_csv(viol_path)
    .pipe(subset_viol)
)

Combining the restaurant data¶

Let's join all three DataFrames together so that we have all the data in a single DataFrame.

In [36]:
rest
Out[36]:
business_id name address zip opened_date
0 211898487641 MOBIL MART LA JOLLA VILLAGE 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037 2002-05-05
1 211930769329 CAFE 477 8950 VILLA LA JOLLA DR, SUITE# B123, LA JOLLA,... 92037-1704 2023-07-24
2 211909057778 VALLEY FARM MARKET 6902 LA JOLLA BLVD, LA JOLLA, CA 92037 92037 2019-01-22
... ... ... ... ... ...
997 211899338714 PACIFIC BEACH ELEMENTARY 1234 TOURMALINE ST, SAN DIEGO, CA 92109-1856 92109-1856 2002-05-05
998 211942150255 POKEWAN DEL MAR 12925 EL CAMINO REAL, SUITE# AA4, SAN DIEGO, C... 92130 2016-11-03
999 211925713322 SAFFRONO LOUNGE RESTAURANT 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014 2022-11-03

1000 rows × 5 columns

In [37]:
insp
Out[37]:
business_id inspection_id score grade date status
0 211898487641 6886133 96 NaN 2023-02-16 Complete
1 211898487641 6631228 98 NaN 2022-01-03 Complete
2 211898487641 6357338 98 NaN 2020-12-03 Complete
... ... ... ... ... ... ...
5176 211925713322 7183979 0 NaN 2023-03-06 Complete
5177 211925713322 7120778 0 NaN 2022-12-09 Complete
5178 211925713322 7104077 90 A 2022-11-30 Complete

5179 rows × 6 columns

In [38]:
def merge_all_restaurant_data():
    return (
        rest
        .merge(insp, on='business_id', how='left')
        .merge(viol, on='inspection_id', how='left')
    )

df = merge_all_restaurant_data()
df
Out[38]:
business_id name address zip ... status kind is_major violation
0 211898487641 MOBIL MART LA JOLLA VILLAGE 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037 ... Complete Hot and Cold Water Y 21. Hot & cold water available
1 211898487641 MOBIL MART LA JOLLA VILLAGE 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037 ... Complete Hot and Cold Water N 21. Hot & cold water available
2 211898487641 MOBIL MART LA JOLLA VILLAGE 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 92037 ... Complete Holding Temperatures N 7. Proper hot & cold holding temperatures
... ... ... ... ... ... ... ... ... ...
8728 211925713322 SAFFRONO LOUNGE RESTAURANT 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014 ... Complete Equipment and Utensil Storage, Use N 35. Equipment / Utensils -approved, installed,...
8729 211925713322 SAFFRONO LOUNGE RESTAURANT 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014 ... Complete Toilet Facilities N 43. Toilet facilities -properly constructed, s...
8730 211925713322 SAFFRONO LOUNGE RESTAURANT 2672 DEL MAR HEIGHTS RD, DEL MAR, CA 92014 92014 ... Complete Floors, Walls, and Ceilings N 45. Floor, walls and ceilings - built, maintai...

8731 rows × 13 columns

Data cleaning: Missing values¶

Missing values¶

Next, it's important to check for and handle missing values, as they can have a big effect on your analysis.

In [39]:
insp[['score', 'grade']]
Out[39]:
score grade
0 96 NaN
1 98 NaN
2 98 NaN
... ... ...
5176 0 NaN
5177 0 NaN
5178 90 A

5179 rows × 2 columns

In [40]:
# The proportion of values in each column that are missing.
insp.isna().mean()
Out[40]:
business_id      0.00
inspection_id    0.00
score            0.00
grade            0.42
date             0.00
status           0.00
dtype: float64
In [41]:
# Why are there null values here?
# insp['inspection_id'] and viol['inspection_id'] don't have any null values...
df['inspection_id'].isna().mean()
Out[41]:
0.0033214981101821095

There are many ways of handling missing values, which we'll cover in an entire lecture next week. But a good first step is to check how many there are!

Data cleaning: Transformations and timestamps¶

Transformations and timestamps¶

From last class:

A transformation results from performing some operation on every element in a sequence, e.g. a Series.

It's often useful to look at ways of transforming your data to make it easier to work with.

  • Type conversions (e.g. changing the string "$2.99" to the number 2.99).
  • Unit conversion (e.g. feet to meters).
  • Extraction (Getting 'vermin' out of 'Vermin Violation Recorded on 10/10/2023').

Creating timestamps¶

Most commonly, we'll parse dates into pd.Timestamp objects.

In [42]:
# Look at the dtype!
insp['date']
Out[42]:
0       2023-02-16
1       2022-01-03
2       2020-12-03
           ...    
5176    2023-03-06
5177    2022-12-09
5178    2022-11-30
Name: date, Length: 5179, dtype: object
In [43]:
# This magical string tells Python what format the date is in.
# For more info: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
date_format = '%Y-%m-%d'
pd.to_datetime(insp['date'], format=date_format)
Out[43]:
0      2023-02-16
1      2022-01-03
2      2020-12-03
          ...    
5176   2023-03-06
5177   2022-12-09
5178   2022-11-30
Name: date, Length: 5179, dtype: datetime64[ns]
In [44]:
# Another advantage of defining functions is that we can reuse this function
# for the 'opened_date' column in `rest` if we wanted to.
def parse_dates(insp, col):
    date_format = '%Y-%m-%d'
    dates = pd.to_datetime(insp[col], format=date_format)
    return insp.assign(**{col: dates})

insp = (
    pd.read_csv(insp_path)
    .pipe(subset_insp)
    .pipe(parse_dates, 'date')
)

# We should also remake df, since it depends on insp.
# Note that the new insp is used to create df!
df = merge_all_restaurant_data()
In [45]:
# Look at the dtype now!
df['date']
Out[45]:
0      2023-02-16
1      2022-01-03
2      2020-12-03
          ...    
8728   2022-11-30
8729   2022-11-30
8730   2022-11-30
Name: date, Length: 8731, dtype: datetime64[ns]

Working with timestamps¶

  • We often want to adjust granularity of timestamps to see overall trends, or seasonality.
  • Use the resample method in pandas (documentation).
    • Think of it like a version of groupby, but for timestamps.
    • For instance, insp.resample('2W', on='date') separates every two weeks of data into a different group.
In [46]:
insp.resample('2W', on='date').mean()
Out[46]:
business_id inspection_id score
date
2020-01-05 2.12e+11 6.35e+06 42.67
2020-01-19 2.12e+11 6.30e+06 59.33
2020-02-02 2.12e+11 6.32e+06 56.34
... ... ... ...
2023-09-24 2.12e+11 7.15e+06 66.60
2023-10-08 2.12e+11 7.19e+06 59.58
2023-10-22 2.12e+11 7.20e+06 66.81

100 rows × 3 columns

In [47]:
# Where are those numbers coming from?
insp[(insp['date'] >= pd.Timestamp('2020-01-05')) &
     (insp['date'] < pd.Timestamp('2020-01-19'))]['score'].mean()
Out[47]:
59.325581395348834
In [48]:
insp.resample('2W', on='date')['score'].mean().plot(title='Average Violation Over Time')

The .dt accessor¶

Like with Series of strings, pandas has a .dt accessor for properties of timestamps (documentation).

In [49]:
insp['date']
Out[49]:
0      2023-02-16
1      2022-01-03
2      2020-12-03
          ...    
5176   2023-03-06
5177   2022-12-09
5178   2022-11-30
Name: date, Length: 5179, dtype: datetime64[ns]
In [50]:
insp['date'].dt.day
Out[50]:
0       16
1        3
2        3
        ..
5176     6
5177     9
5178    30
Name: date, Length: 5179, dtype: int64
In [51]:
insp['date'].dt.dayofweek
Out[51]:
0       3
1       0
2       3
       ..
5176    0
5177    4
5178    2
Name: date, Length: 5179, dtype: int64
In [52]:
fig = px.bar(
    insp['date'].dt.dayofweek.value_counts()
)
fig.update_xaxes(tickvals=np.arange(7), ticktext=['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])

Data cleaning: Modifying structure¶

Reshaping DataFrames¶

We often reshape the DataFrame's structure to make it more convenient for analysis. For example, we can:

  • Simplify structure by removing columns or taking a set of rows for a particular period of time or geographic area.
    • We already did this!
  • Adjust granularity by aggregating rows together.
    • To do this, use groupby (or resample, if working with timestamps).
  • Reshape structure, most commonly by using the DataFrame melt method to un-pivot a dataframe.

Using melt¶

  • The melt method is common enough that we'll give it a special mention.
  • We'll often encounter pivot tables (esp. from government data), which we call wide data.
  • The methods we've introduced work better with long-form data, or tidy data.
  • To go from wide to long, melt.
No description has been provided for this image

Example usage of melt¶

In [53]:
wide_example = pd.DataFrame({
    'Year': [2001, 2002],
    'Jan': [10, 130],
    'Feb': [20, 200],
    'Mar': [30, 340]
}).set_index('Year')
wide_example
Out[53]:
Jan Feb Mar
Year
2001 10 20 30
2002 130 200 340
In [54]:
wide_example.melt(ignore_index=False)
Out[54]:
variable value
Year
2001 Jan 10
2002 Jan 130
2001 Feb 20
2002 Feb 200
2001 Mar 30
2002 Mar 340

Exploration¶

Question 🤔 (Answer at q.dsc80.com)

What questions do you want me to try and answer with the data? I'll start with a single pre-prepared question, and then answer student questions until we run out of time.

Example question: Can we rank restaurants by their number of violations? How about separately for each zip code?¶

And why would we want to do that? 🤔

In [55]:
rank_by_violations = lambda df: (
    df
    .groupby('name')
    ['zip']
    .agg(['first', 'size'])
    .sort_values('size', ascending=False)
    .reset_index()
    .pipe(lambda frame: frame.assign(rank=frame.index + 1).rename(columns={'first': 'zip'}))
)
In [56]:
df.pipe(rank_by_violations)
Out[56]:
name zip size rank
0 WHOLE FOODS MARKET 92037-1949 62 1
1 RIGOBERTOS TACO SHOP 92037 61 2
2 OKAN DINER 92117 52 3
... ... ... ... ...
944 JUST LIKE MOM MAKES 92130-2283 1 945
945 KEVIES KITCHEN 92117-1120 1 946
946 PETRA KITCHEN 92117-4949 1 947

947 rows × 4 columns

In [57]:
(
    df
    .groupby('zip')
    .apply(rank_by_violations)
    .reset_index(drop=True)
    .pivot(index='rank', columns='zip', values='name')
)
Out[57]:
zip 92014 92014-3101 92014-3110 92014-3149 ... 92130-6657 92130-6974 92130-8605 92590
rank
1 LE BAMBOU RESTAURANT 7-ELEVEN #13628C-2111 BUSHFIRE KITCHEN JERSEY MIKES SUBS ... COPA VIDA CAFE OCEAN AIR ELEMENTARY CRISCITO PIZZA DEVILICIOUS FOOD TRUCK
2 SAFFRONO LOUNGE RESTAURANT NaN STARBUCKS CRUMBL COOKIE ... NaN NaN NaN NaN
3 JACK IN THE BOX #0081 NaN NaN NaN ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
155 NaN NaN NaN NaN ... NaN NaN NaN NaN
156 NaN NaN NaN NaN ... NaN NaN NaN NaN
157 NaN NaN NaN NaN ... NaN NaN NaN NaN

157 rows × 322 columns

In [58]:
df.query('zip == "92014-3149"').value_counts('name')
Out[58]:
name
JERSEY MIKES SUBS    13
CRUMBL COOKIE         4
dtype: int64

Example question: Which restaurants have vermin violations?¶

In [59]:
(
    df[df['violation'].str.lower().str.contains('vermin', na=False)]
    .value_counts('name')
    .head(6)
)
Out[59]:
name
ISOLA LA JOLLA                 4
L AND F CAFE                   3
MONGOLIAN HOT POT              3
GIRARD GOURMET                 3
HOLY GAO CHINESE RESTAURANT    3
JOSES COURT ROOM               3
dtype: int64

Example question: What are some of the most uncommon violations?¶

In [60]:
df.groupby('kind').filter(lambda df: df.shape[0] == 1)
Out[60]:
business_id name address zip ... status kind is_major violation
441 211996758893 SHAKA JAVA 3655 NOBEL DR, SAN DIEGO, CA 92122 92122 ... Complete Commissary/MSU Operation N Commissary/MSU Operation
2311 211973776043 BARBARELLA RESTAURANT 2171 AVENIDA DE LA PLAYA, LA JOLLA, CA 92037 92037 ... Complete Gulf Oysters Y 17. Compliance with Gulf Oyster Regulations
5096 211898213216 BISHOP SCHOOL 7607 LA JOLLA BLVD, LA JOLLA, CA 92037 92037 ... Complete Highly Susceptible Populations Y 20. Licensed health care facilities / public &...
6386 211918747238 L & E MUSCLE MEALS 4577 CLAIREMONT DR, SAN DIEGO, CA 92117 92117 ... Complete No discharge from eyes, nose, or mouth Y 3. No discharge from eyes, nose or mouth
7848 211986408394 LAZY EYE COFFEE 4011 AVATI DR, SAN DIEGO, CA 92117-4403 92117-4403 ... Complete Food Labeling N 29. Food properly labeled and honestly presented

5 rows × 13 columns

What's the worst subway?¶

In [61]:
df[df['name'].str.lower().str.contains('subway')].value_counts('address')
Out[61]:
address
3860 VALLEY CENTRE DR, SUITE# 403, SAN DIEGO, CA 92130       18
6715 MIRA MESA BLVD, SUITE# 102, SAN DIEGO, CA 92121-4379    13
4653 CARMEL MTN RD, SUITE# 310, SAN DIEGO, CA 92130           9
2206 TORREY PINES RD, SUITE# C, LA JOLLA, CA 92037-3472       7
4973 CLAIREMONT DR, SUITE# A, SAN DIEGO, CA 92117             3
5517 CLAIREMONT MESA BLVD, SAN DIEGO, CA 92117                1
dtype: int64

Summary, next time¶

Summary¶

  • Data cleaning is a necessary starting step in data analysis. There are four pillars of data cleaning:
    • Quality checks.
    • Missing values.
    • Transformations and timestamps.
    • Modifying structure.
  • Approach EDA with an open mind, and draw lots of visualizations.

Next time¶

Hypothesis and permutation testing. Some of this will be DSC 10 review, but we'll also push further! Expect a pre-lecture reading tomorrow!