In [1]:
# Run this cell if you're following along – it just helps make the lectures appear prettier.
import pandas as pd
import numpy as np

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 4 – DataFrames: Accessing, Sorting, and Querying¶

DSC 10, Spring 2023¶

Announcements¶

  • Lab 0 is due tomorrow at 11:59PM.
    • Submit early and avoid Submission Errors.
    • You can turn it in up to two days late using slip days – see the Syllabus for more information. You never need to ask to use a slip day.
    • Please fill out the Welcome Survey as well!
  • Lab 1 is the next assignment, and it's due on Saturday at 11:59PM.
  • Come to office hours (see the schedule here) and post on Ed for help!
  • Watch this video, which walks through the activity from the end of Friday's lecture.
  • You must be present when attendance is taken in discussion to get credit, even if you have a conflicting class.

Agenda¶

Today, we'll use a real dataset and lots of motivating questions to illustrate key DataFrame manipulation techniques.

Note:¶

  • Remember to check the Resources tab of the course website for programming resources.
  • Some key links moving forward:
    • DSC 10 Reference Sheet.
    • babypandas notes.
    • babypandas documentation.

DataFrames¶

pandas¶

  • pandas is a Python package that allows us to work with tabular data – that is, data in the form of a table that we might otherwise work with as a spreadsheet (in Excel or Google Sheets).
  • pandas is the tool for doing data science in Python.

But pandas is not so cute...¶

Enter babypandas!¶

  • We at UCSD have created a smaller, nicer version of pandas called babypandas.
  • It keeps the important stuff and has much better error messages.
  • It's easier to learn, but is still valid pandas code. You are learning pandas!

DataFrames in babypandas 🐼¶

  • Tables in babypandas (and pandas) are called "DataFrames."
  • To use DataFrames, we'll need to import babypandas. (We'll need numpy as well.)
In [2]:
import babypandas as bpd
import numpy as np

About the Data: Get It Done 👷¶

  • We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
  • The file data/get-it-done-apr-08.csv contains service requests made on April 8th, 2023 through the Get It Done program.
  • Get It Done allows the general public to report non-emergency problems to the City of San Diego through a mobile app, website, or phone call.

Reading data from a file 📖¶

We can read in a CSV using bpd.read_csv(...). Give it the path to a file relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file.

In [3]:
apr_08 = bpd.read_csv('data/get-it-done-apr-08.csv')
apr_08
Out[3]:
service_request_id date_requested neighborhood service status street_address public_description
0 4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
1 4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
2 4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
3 4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4 4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ... ...
1084 4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
1085 4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
1086 4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
1087 4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
1088 4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 7 columns

Structure of a DataFrame¶

  • DataFrames have columns and rows.
    • Think of each column as an array. Columns contain data of the same type.
  • Each column has a label, e.g. 'neighborhood' and 'status'.
    • A column's label is its name.
    • Column labels are stored as strings.
  • Each row has a label too.
    • Together, the row labels are called the index. The index is not a column!
In [4]:
# This DataFrame has 1089 rows and 7 columns.
apr_08
Out[4]:
service_request_id date_requested neighborhood service status street_address public_description
0 4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
1 4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
2 4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
3 4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4 4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ... ...
1084 4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
1085 4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
1086 4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
1087 4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
1088 4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 7 columns

Setting a new index¶

  • We can set a better index using .set_index(column_name).
  • Row labels should be unique identifiers.
    • Row labels are row names; ideally, each row has a different, descriptive name.
In [5]:
apr_08.set_index('service_request_id')
Out[5]:
date_requested neighborhood service status street_address public_description
service_request_id
4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ...
4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 6 columns

🚨 Like most DataFrame methods, .set_index returns a new DataFrame; it does not modify the original DataFrame.

In [6]:
apr_08
Out[6]:
service_request_id date_requested neighborhood service status street_address public_description
0 4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
1 4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
2 4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
3 4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4 4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ... ...
1084 4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
1085 4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
1086 4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
1087 4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
1088 4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 7 columns

In [7]:
apr_08 = apr_08.set_index('service_request_id')
apr_08
Out[7]:
date_requested neighborhood service status street_address public_description
service_request_id
4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ...
4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 6 columns

Shape of a DataFrame¶

  • .shape returns the number of rows and columns in a given DataFrame.
  • Access each with []:
    • .shape[0] for rows.
    • .shape[1] for columns.
In [8]:
# There were 7 columns before, but one of them became the index, and the index is not a column!
apr_08.shape
Out[8]:
(1089, 6)
In [9]:
# Number of rows.
apr_08.shape[0]
Out[9]:
1089
In [10]:
# Number of columns.
apr_08.shape[1]
Out[10]:
6

Annual summary of Get It Done requests¶

  • The file data/get-it-done-requests.csv contains a summary of all Get It Done requests submitted so far in 2023.
  • This dataset describes the types of problems being reported in each neighborhood and the number of service requests that are resolved ('closed') versus unresolved ('open').
In [11]:
requests = bpd.read_csv('data/get-it-done-requests.csv')
requests
Out[11]:
neighborhood service closed open
0 Balboa Park Dead Animal 11 0
1 Balboa Park Development Services - Code Enforcement 2 0
2 Balboa Park Encampment 215 20
3 Balboa Park Environmental Services Code Compliance 8 1
4 Balboa Park Graffiti - Code Enforcement 2 1
... ... ... ... ...
1416 Via De La Valle Encampment 0 1
1417 Via De La Valle Pavement Maintenance 0 3
1418 Via De La Valle Pothole 11 7
1419 Via De La Valle Sidewalk Repair Issue 0 1
1420 Via De La Valle Street Sweeping 1 0

1421 rows × 4 columns

Example 1: Total requests¶

Key concepts: Accessing columns, performing operations with them, and adding new columns.

Finding total requests¶

Question: How many total service requests of each type in each neighborhood have been made this year?

In [12]:
requests
Out[12]:
neighborhood service closed open
0 Balboa Park Dead Animal 11 0
1 Balboa Park Development Services - Code Enforcement 2 0
2 Balboa Park Encampment 215 20
3 Balboa Park Environmental Services Code Compliance 8 1
4 Balboa Park Graffiti - Code Enforcement 2 1
... ... ... ... ...
1416 Via De La Valle Encampment 0 1
1417 Via De La Valle Pavement Maintenance 0 3
1418 Via De La Valle Pothole 11 7
1419 Via De La Valle Sidewalk Repair Issue 0 1
1420 Via De La Valle Street Sweeping 1 0

1421 rows × 4 columns

  • We have, separately, the number of closed service requests and open service requests of each type in each neighborhood.
  • Steps:
    • Get the column of closed requests.
    • Get the column of open requests.
    • Add these columns element-wise.
    • Add a new column to the DataFrame with these totals.

Step 1 – Getting the coluimn of closed requests¶

  • We can get a column from a DataFrame using .get(column_name).
  • 🚨 Column names are case sensitive!
  • Column names are strings, so we need to use quotes.
  • The result looks like a 1-column DataFrame, but is actually a Series.
In [13]:
requests
Out[13]:
neighborhood service closed open
0 Balboa Park Dead Animal 11 0
1 Balboa Park Development Services - Code Enforcement 2 0
2 Balboa Park Encampment 215 20
3 Balboa Park Environmental Services Code Compliance 8 1
4 Balboa Park Graffiti - Code Enforcement 2 1
... ... ... ... ...
1416 Via De La Valle Encampment 0 1
1417 Via De La Valle Pavement Maintenance 0 3
1418 Via De La Valle Pothole 11 7
1419 Via De La Valle Sidewalk Repair Issue 0 1
1420 Via De La Valle Street Sweeping 1 0

1421 rows × 4 columns

In [14]:
requests.get('closed')
Out[14]:
0        11
1         2
2       215
3         8
4         2
       ... 
1416      0
1417      0
1418     11
1419      0
1420      1
Name: closed, Length: 1421, dtype: int64

Digression: Series¶

  • A Series is like an array, but with an index.
  • In particular, Series support arithmetic, just like arrays.
In [15]:
requests.get('closed')
Out[15]:
0        11
1         2
2       215
3         8
4         2
       ... 
1416      0
1417      0
1418     11
1419      0
1420      1
Name: closed, Length: 1421, dtype: int64
In [16]:
type(requests.get('closed'))
Out[16]:
babypandas.bpd.Series

Steps 2 and 3 – Getting the column of open requests and calculating the total¶

In [17]:
requests.get('open')
Out[17]:
0        0
1        0
2       20
3        1
4        1
        ..
1416     1
1417     3
1418     7
1419     1
1420     0
Name: open, Length: 1421, dtype: int64
  • Just like with arrays, we can perform arithmetic operations with two Series, as long as they have the same length and same index.
  • Operations happen element-wise, and the result is also a Series.
In [18]:
requests.get('closed') + requests.get('open')
Out[18]:
0        11
1         2
2       235
3         9
4         3
       ... 
1416      1
1417      3
1418     18
1419      1
1420      1
Length: 1421, dtype: int64

Step 4 – Adding the totals to the DataFrame as a new column¶

  • Use .assign(name_of_column=data_in_series) to assign a Series (or array, or list) to a DataFrame.
  • 🚨 Don't put quotes around name_of_column.
  • This creates a new DataFrame, which we must save to a variable if we want to keep using it.
In [19]:
requests.assign(
    total=requests.get('closed') + requests.get('open')
)
Out[19]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
4 Balboa Park Graffiti - Code Enforcement 2 1 3
... ... ... ... ... ...
1416 Via De La Valle Encampment 0 1 1
1417 Via De La Valle Pavement Maintenance 0 3 3
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [20]:
requests
Out[20]:
neighborhood service closed open
0 Balboa Park Dead Animal 11 0
1 Balboa Park Development Services - Code Enforcement 2 0
2 Balboa Park Encampment 215 20
3 Balboa Park Environmental Services Code Compliance 8 1
4 Balboa Park Graffiti - Code Enforcement 2 1
... ... ... ... ...
1416 Via De La Valle Encampment 0 1
1417 Via De La Valle Pavement Maintenance 0 3
1418 Via De La Valle Pothole 11 7
1419 Via De La Valle Sidewalk Repair Issue 0 1
1420 Via De La Valle Street Sweeping 1 0

1421 rows × 4 columns

In [21]:
requests = requests.assign(
    total=requests.get('closed') + requests.get('open')
)
requests
Out[21]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
4 Balboa Park Graffiti - Code Enforcement 2 1 3
... ... ... ... ... ...
1416 Via De La Valle Encampment 0 1 1
1417 Via De La Valle Pavement Maintenance 0 3 3
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

Example 2: Analyzing requests¶

Key concept: Computing statistics of columns using Series methods.

Questions¶

  • What is the largest number of service requests for any one service in any one neighborhood?
  • What is a typical number of service requests for any one service in any one neighborhood?

Series, like arrays, have helpful methods, including .min(), .max(), and .mean().

In [22]:
requests.get('total').max()
Out[22]:
4970

What is it that people are reporting so frequently, and where? We'll see how to find out shortly!

Other statistics:

In [23]:
requests.get('total').mean()
Out[23]:
82.88529204785362
In [24]:
requests.get('total').median()
Out[24]:
18.0
In [25]:
requests.get('open').mean()
Out[25]:
16.58972554539057
In [26]:
requests.get('open').median()
Out[26]:
3.0
In [27]:
# Lots of information at once!
requests.get('total').describe()
Out[27]:
count    1421.00
mean       82.89
std       258.37
min         1.00
25%         5.00
50%        18.00
75%        54.00
max      4970.00
Name: total, dtype: float64

Example 3: What and where is the most frequently requested service?¶

Key concepts: Sorting. Accessing using integer positions.

Step 1 – Sorting the DataFrame¶

  • Use the .sort_values(by=column_name) method to sort.
    • The by= can be omitted, but helps with readability.
  • Like most DataFrame methods, this returns a new DataFrame.
In [28]:
requests.sort_values(by='total')
Out[28]:
neighborhood service closed open total
1420 Via De La Valle Street Sweeping 1 0 1
1105 San Ysidro Pavement Maintenance 0 1 1
1092 San Pasqual Traffic Signal Timing 1 0 1
1091 San Pasqual Traffic Signal Issue 1 0 1
1090 San Pasqual Traffic Sign Maintenance 1 0 1
... ... ... ... ... ...
1389 Uptown Graffiti - Public 1422 384 1806
147 Clairemont Mesa Pothole 1847 164 2011
1213 Southeastern San Diego Illegal Dumping 2965 15 2980
404 Mid-City:City Heights Illegal Dumping 3997 51 4048
207 Downtown Encampment 4321 649 4970

1421 rows × 5 columns

This sorts, but in ascending order (small to large). The opposite would be nice!

Step 1 – Sorting the DataFrame in descending order¶

  • Use .sort_values(by=column_name, ascending=False) to sort in descending order.
  • ascending is an optional argument. If omitted, it will be set to True by default.
    • This is an example of a keyword argument, or a named argument.
    • If we want to specify the sorting order, we must use the keyword ascending=.
In [29]:
ordered_requests = requests.sort_values(by='total', ascending=False)
ordered_requests
Out[29]:
neighborhood service closed open total
207 Downtown Encampment 4321 649 4970
404 Mid-City:City Heights Illegal Dumping 3997 51 4048
1213 Southeastern San Diego Illegal Dumping 2965 15 2980
147 Clairemont Mesa Pothole 1847 164 2011
1389 Uptown Graffiti - Public 1422 384 1806
... ... ... ... ... ...
391 Los Penasquitos Canyon Environmental Services Code Compliance 1 0 1
346 La Jolla Right-of-Way Code Enforcement 0 1 1
274 Fairbanks Ranch Country Club Stormwater 1 0 1
243 East Elliott Tree Maintenance 1 0 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [30]:
# We must specify the role of False by using ascending=, 
# otherwise Python does not know how to interpret this.
requests.sort_values(by='total', False)
  File "/var/folders/ch/hyjw6whx3g9gshnp58738jc80000gp/T/ipykernel_2092/3731718929.py", line 3
    requests.sort_values(by='total', False)
                                          ^
SyntaxError: positional argument follows keyword argument

Step 2 – Extracting the neighborhood and service¶

  • We saw that the most reported issue is 'Encampment' in 'Downtown', but how do we extract that information using code?
  • First, grab an entire column as a Series.
  • Navigate to a particular entry of the Series using .iloc[integer_position].
    • iloc stands for "integer location."
In [31]:
ordered_requests
Out[31]:
neighborhood service closed open total
207 Downtown Encampment 4321 649 4970
404 Mid-City:City Heights Illegal Dumping 3997 51 4048
1213 Southeastern San Diego Illegal Dumping 2965 15 2980
147 Clairemont Mesa Pothole 1847 164 2011
1389 Uptown Graffiti - Public 1422 384 1806
... ... ... ... ... ...
391 Los Penasquitos Canyon Environmental Services Code Compliance 1 0 1
346 La Jolla Right-of-Way Code Enforcement 0 1 1
274 Fairbanks Ranch Country Club Stormwater 1 0 1
243 East Elliott Tree Maintenance 1 0 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [32]:
ordered_requests.get('neighborhood')
Out[32]:
207                         Downtown
404            Mid-City:City Heights
1213          Southeastern San Diego
147                  Clairemont Mesa
1389                          Uptown
                    ...             
391           Los Penasquitos Canyon
346                         La Jolla
274     Fairbanks Ranch Country Club
243                     East Elliott
1420                 Via De La Valle
Name: neighborhood, Length: 1421, dtype: object
In [33]:
ordered_requests.get('neighborhood').iloc[0]
Out[33]:
'Downtown'
In [34]:
ordered_requests.get('service').iloc[0]
Out[34]:
'Encampment'

Example 4: Status of a request¶

Key concept: Accessing using row labels.

Status of a request¶

  • On April 8th, you submitted service request 4183848. Has the issue been resolved?

  • This cannot be answered from the annual summary data, but must be answered from the detailed data about April 8th.

In [35]:
apr_08
Out[35]:
date_requested neighborhood service status street_address public_description
service_request_id
4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ...
4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 6 columns

Your service request is buried in the middle of the DataFrame. Only the first few rows and last few rows are shown, so you can't tell just by looking at the DataFrame.

Accessing using the row label¶

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:

  1. Use .get(column_name) to extract the entire column as a Series.
  2. Use .loc[] to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [36]:
apr_08.get('status')
Out[36]:
service_request_id
4183116      open
4183117      open
4183118      open
4183119      open
4183120      open
            ...  
4184220    closed
4184221    closed
4184223    closed
4184225    closed
4184294    closed
Name: status, Length: 1089, dtype: object
In [37]:
apr_08.get('status').loc[4183848]
Out[37]:
'open'

Activity 🚔¶

Oh no, your service request 4183848 has still not been resolved! What was the problem again?

Write one line of code that evaluates to the full description of the problem, as you described it in your service request.

In [38]:
...
Out[38]:
Ellipsis

Summary: Accessing elements in a Series¶

  • There are two ways to get an element of a Series:
    • .loc[] uses the row label.
    • .iloc[] uses the integer position.
  • Usually .loc[] is more convenient, but each is best for different scenarios.

Note¶

  • Sometimes the integer position and row label are the same.
  • This happens by default with bpd.read_csv.
In [39]:
bpd.read_csv('data/get-it-done-apr-08.csv')
Out[39]:
service_request_id date_requested neighborhood service status street_address public_description
0 4183116 2023-04-08T00:32:00 Downtown Traffic Signal Issue open 2ND AVE & G ST Signal, 2nd & G
1 4183117 2023-04-08T00:44:00 Mid-City:Eastern Area Missed Collection open 4791 Seminole Dr, San Diego, CA 92115, USA My Blue Recycle Bin was not collected. I'm th...
2 4183118 2023-04-08T00:49:00 Navajo Parking open 4728 Allied Rd, San Diego, CA 92120, USA White van parked across my driveway. I will n...
3 4183119 2023-04-08T01:04:00 Encanto Neighborhoods Missed Collection open 6066 Tempas Ct Green container
4 4183120 2023-04-08T01:10:00 Navajo Pothole open 7961?7979 Topaz Lake Ave Potholes || LOCATION: Topaz Lake between Pear...
... ... ... ... ... ... ... ...
1084 4184220 2023-04-08T19:53:00 Barrio Logan Other closed 3718 Dalbergia St Prostitution
1085 4184221 2023-04-08T19:53:00 Barrio Logan Other closed 3743 Dalbergia St Prostitution
1086 4184223 2023-04-08T19:54:00 Barrio Logan Other closed 2120 Woden St Prostitution
1087 4184225 2023-04-08T19:54:00 Barrio Logan Other closed 2005 Vesta St Prostitution
1088 4184294 2023-04-08T23:36:00 Southeastern San Diego Other closed 3762 Cottonwood St Loud music

1089 rows × 7 columns

In [40]:
bpd.read_csv('data/get-it-done-apr-08.csv').get('public_description').loc[561]
Out[40]:
'Abandoned van'
In [41]:
bpd.read_csv('data/get-it-done-apr-08.csv').get('public_description').iloc[561]
Out[41]:
'Abandoned van'

Reflection¶

Questions we can answer right now...¶

  • What is the largest number of open requests of one type in one neighborhood?
    • requests.get('open').max().
  • How many requests were made on April 8th?
    • apr_08.shape[0].
  • What is the description of the latest request made on April 8th?
    • apr_08.sort_values(by='date_requested', ascending=False).get('public_description').iloc[0].

Moving forward, let's just focus on the requests DataFrame. As a reminder, here's what it looks like:

In [42]:
requests
Out[42]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
4 Balboa Park Graffiti - Code Enforcement 2 1 3
... ... ... ... ... ...
1416 Via De La Valle Encampment 0 1 1
1417 Via De La Valle Pavement Maintenance 0 3 3
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

Questions we can't yet answer...¶

  • Which neighborhood has the most 'Pothole' requests?
  • What is the most commonly requested service in the 'University' neighborhood (near UCSD)?
  • In the 'Downtown' neighborhood, how many open service requests are there?

The common thread between these questions is that they all involve only a subset of the rows in our DataFrame.

Example 5: Which neighborhood has the most 'Pothole' requests? 🕳¶

Key concept: Querying.

Selecting rows¶

  • We could determine the neighborhood with the most 'Pothole' requests if we had a DataFrame consisting of only these type of requests.
    • We would sort by the 'total' column in descending order, then extract the neighborhood name in the first row.
  • How do we get that DataFrame?

The solution¶

In [43]:
# This DataFrame only contains rows where the 'service' is 'Pothole'!
only_potholes = requests[requests.get('service') == 'Pothole']
only_potholes
Out[43]:
neighborhood service closed open total
12 Balboa Park Pothole 70 8 78
40 Barrio Logan Pothole 111 9 120
67 Black Mountain Ranch Pothole 19 1 20
91 Carmel Mountain Ranch Pothole 245 85 330
118 Carmel Valley Pothole 111 98 209
... ... ... ... ... ...
1314 Torrey Hills Pothole 11 1 12
1339 Torrey Pines Pothole 280 161 441
1366 University Pothole 712 249 961
1396 Uptown Pothole 513 87 600
1418 Via De La Valle Pothole 11 7 18

56 rows × 5 columns

Now that we have a DataFrame with only 'Pothole' requests, we can sort by 'total' in descending order and extract the 'neighborhood' in the first row, just like we planned:

In [44]:
only_potholes.sort_values('total', ascending=False).get('neighborhood').iloc[0]
Out[44]:
'Clairemont Mesa'

🤯 What just happened?

Aside: Booleans¶

  • When we compare two values, the result is either True or False.
    • Notice, these words are not in quotes.
  • bool is a data type in Python, just like int, float, and str.
    • It stands for "Boolean", named after George Boole, an early mathematician.
  • There are only two possible Boolean values: True or False.
    • Yes or no.
    • On or off.
    • 1 or 0.
In [45]:
5 == 6
Out[45]:
False
In [46]:
type(5 == 6)
Out[46]:
bool
In [47]:
9 + 10 < 21
Out[47]:
True
In [48]:
'zebra' == 'zeb' + 'ra'
Out[48]:
True

Comparison operators¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

When comparing an entire Series to a value, the result is a Series of bools.

In [49]:
requests
Out[49]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
4 Balboa Park Graffiti - Code Enforcement 2 1 3
... ... ... ... ... ...
1416 Via De La Valle Encampment 0 1 1
1417 Via De La Valle Pavement Maintenance 0 3 3
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [50]:
requests.get('service') == 'Pothole'
Out[50]:
0       False
1       False
2       False
3       False
4       False
        ...  
1416    False
1417    False
1418     True
1419    False
1420    False
Name: service, Length: 1421, dtype: bool

What is a query? 🤔¶

  • A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
  • We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

How do we query a DataFrame?¶

To select only certain rows of requests:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
  2. Then pass it into requests[sequence_goes_here].
In [51]:
requests[requests.get('service') == 'Pothole']
Out[51]:
neighborhood service closed open total
12 Balboa Park Pothole 70 8 78
40 Barrio Logan Pothole 111 9 120
67 Black Mountain Ranch Pothole 19 1 20
91 Carmel Mountain Ranch Pothole 245 85 330
118 Carmel Valley Pothole 111 98 209
... ... ... ... ... ...
1314 Torrey Hills Pothole 11 1 12
1339 Torrey Pines Pothole 280 161 441
1366 University Pothole 712 249 961
1396 Uptown Pothole 513 87 600
1418 Via De La Valle Pothole 11 7 18

56 rows × 5 columns

Another query¶

This time, we'll find the neighborhoods and services with over 100 open requests.

In [52]:
requests
Out[52]:
neighborhood service closed open total
0 Balboa Park Dead Animal 11 0 11
1 Balboa Park Development Services - Code Enforcement 2 0 2
2 Balboa Park Encampment 215 20 235
3 Balboa Park Environmental Services Code Compliance 8 1 9
4 Balboa Park Graffiti - Code Enforcement 2 1 3
... ... ... ... ... ...
1416 Via De La Valle Encampment 0 1 1
1417 Via De La Valle Pavement Maintenance 0 3 3
1418 Via De La Valle Pothole 11 7 18
1419 Via De La Valle Sidewalk Repair Issue 0 1 1
1420 Via De La Valle Street Sweeping 1 0 1

1421 rows × 5 columns

In [53]:
requests.get('open') > 100
Out[53]:
0       False
1       False
2       False
3       False
4       False
        ...  
1416    False
1417    False
1418    False
1419    False
1420    False
Name: open, Length: 1421, dtype: bool
In [54]:
requests[requests.get('open') > 100]
Out[54]:
neighborhood service closed open total
147 Clairemont Mesa Pothole 1847 164 2011
207 Downtown Encampment 4321 649 4970
210 Downtown Graffiti - Public 1039 345 1384
218 Downtown ROW Maintenance 28 120 148
220 Downtown Shared Mobility Device 1466 160 1626
... ... ... ... ... ...
1366 University Pothole 712 249 961
1386 Uptown Encampment 763 121 884
1389 Uptown Graffiti - Public 1422 384 1806
1391 Uptown Missed Collection 656 103 759
1393 Uptown Parking 608 258 866

38 rows × 5 columns

Summary¶

Summary¶

  • We learned many DataFrame methods and techniques. Don't feel the need to memorize them all right away.
  • Instead, refer to this lecture, the DSC 10 reference sheet, the babypandas notes, and the babypandas documentation when working on assignments.
  • Over time, these techniques will become more and more familiar. Lab 1 will walk you through many of them.
  • Practice! Frame your own questions using this dataset and try to answer them.

Next time¶

  • We'll start by reviewing queries, and talk about how to write queries with multiple conditions.
  • We'll answer more complicated questions, which will lead us to a new core DataFrame method, .groupby, for organizing rows of a DataFrame with the same value in a particular column.
  • For example, we might want to organize the data by neighborhood, collecting all the different service requests for each neighborhood.