from dsc80_utils import *
Announcements 📣¶
- Lab 2 due tomorrow, Wed, April 17.
- Project 1 is due this Fri, April 19.
Agenda 📆¶
- Other data representations.
- Dataset overview.
- Introduction to
plotly
. - Exploratory data analysis and feature types.
- Data cleaning.
- Data quality checks.
- Missing values.
- Transformations and timestamps.
- Modifying structure.
- Investigating student-submitted questions!
Question 🤔 (Answer at q.dsc80.com)
Remember, you can always ask questions at q.dsc80.com! If the link doesn't work for you, click the 🤔 Lecture Questions link in the top right corner of the course website.
Dataset overview¶
San Diego food safety¶
From this article (archive link):
In the last three years, one third of San Diego County restaurants have had at least one major food safety violation.
99% Of San Diego Restaurants Earn ‘A' Grades, Bringing Usefulness of System Into Question¶
From this article (archive link):
Food held at unsafe temperatures. Employees not washing their hands. Dirty countertops. Vermin in the kitchen. An expired restaurant permit.
Restaurant inspectors for San Diego County found these violations during a routine health inspection of a diner in La Mesa in November 2016. Despite the violations, the restaurant was awarded a score of 90 out of 100, the lowest possible score to achieve an ‘A’ grade.
The data¶
- We downloaded the data about the 1000 restaurants closest to UCSD from here.
- We had to download the data as JSON files, then process it into DataFrames. You'll learn how to do this soon!
- Until now, you've (largely) been presented with CSV files that
pd.read_csv
could load without any issues. - But there are many different formats and possible issues when loading data in from files.
- See Chapter 8 of Learning DS for more.
- Until now, you've (largely) been presented with CSV files that
rest_path = Path('data') / 'restaurants.csv'
insp_path = Path('data') / 'inspections.csv'
viol_path = Path('data') / 'violations.csv'
rest = pd.read_csv(rest_path)
insp = pd.read_csv(insp_path)
viol = pd.read_csv(viol_path)
Question 🤔 (Answer at q.dsc80.com)
The first article said that one third of restaurants had at least one major safety violation.
Which DataFrames and columns seem most useful to verify this?
rest.head(2)
business_id | name | business_type | address | ... | lat | long | opened_date | distance | |
---|---|---|---|---|---|---|---|---|---|
0 | 211898487641 | MOBIL MART LA JOLLA VILLAGE | Pre-Packaged Retail Market | 3233 LA JOLLA VILLAGE DR, LA JOLLA, CA 92037 | ... | 32.87 | -117.23 | 2002-05-05 | 0.62 |
1 | 211930769329 | CAFE 477 | Low Risk Food Facility | 8950 VILLA LA JOLLA DR, SUITE# B123, LA JOLLA,... | ... | 32.87 | -117.24 | 2023-07-24 | 0.64 |
2 rows × 12 columns
rest.columns
Index(['business_id', 'name', 'business_type', 'address', 'city', 'zip', 'phone', 'status', 'lat', 'long', 'opened_date', 'distance'], dtype='object')
insp.head(2)
custom_id | business_id | inspection_id | description | ... | completed_date | status | link | status_link | |
---|---|---|---|---|---|---|---|---|---|
0 | DEH2002-FFPN-310012 | 211898487641 | 6886133 | NaN | ... | 2023-02-16 | Complete | http://www.sandiegocounty.gov/deh/fhd/ffis/ins... | http://www.sandiegocounty.gov/deh/fhd/ffis/ins... |
1 | DEH2002-FFPN-310012 | 211898487641 | 6631228 | NaN | ... | 2022-01-03 | Complete | http://www.sandiegocounty.gov/deh/fhd/ffis/ins... | http://www.sandiegocounty.gov/deh/fhd/ffis/ins... |
2 rows × 11 columns
insp.columns
Index(['custom_id', 'business_id', 'inspection_id', 'description', 'type', 'score', 'grade', 'completed_date', 'status', 'link', 'status_link'], dtype='object')
viol.head(2)
inspection_id | violation | major_violation | status | violation_text | correction_type_link | violation_accela | link | |
---|---|---|---|---|---|---|---|---|
0 | 6886133 | Hot and Cold Water | Y | Out of Compliance - Major | Hot and Cold Water | http://www.sandiegocounty.gov/deh/fhd/ffis/vio... | 21. Hot & cold water available | http://www.sandiegocounty.gov/deh/fhd/ffis/vio... |
1 | 6631228 | Hot and Cold Water | N | Out of Compliance - Minor | Hot and Cold Water | http://www.sandiegocounty.gov/deh/fhd/ffis/vio... | 21. Hot & cold water available | http://www.sandiegocounty.gov/deh/fhd/ffis/vio... |
viol.columns
Index(['inspection_id', 'violation', 'major_violation', 'status', 'violation_text', 'correction_type_link', 'violation_accela', 'link'], dtype='object')
Introduction to plotly
¶
plotly
¶
- We've used
plotly
in lecture briefly, and you even have to use it in Project 1 Question 13, but we haven't yet discussed it formally.
- It's a visualization library that enables interactive visualizations.
Using plotly
¶
There are a few ways we can use plotly
:
- Using the
plotly.express
syntax.plotly
is very flexible, but it can be verbose;plotly.express
allows us to make plots quickly.- See the documentation here – it's very rich (there are good examples for almost everything).
- By setting
pandas
plotting backend to'plotly'
(by default, it's'matplotlib'
) and using the DataFrameplot
method.- The DataFrame
plot
method is how you created plots in DSC 10!
- The DataFrame
For now, we'll use plotly.express
syntax; we've imported it in the dsc80_utils.py
file that we import at the top of each lecture notebook.
Initial plots¶
First, let's look at the distribution of inspection 'score'
s:
fig = px.histogram(insp['score'])
fig