Lecture 4 – DataFrames: Accessing, Sorting, and Querying¶

DSC 10, Winter 2023¶

Announcements¶

  • Lab 0 was due last night, but you can still turn it in late using slip days. You never need to ask to use a slip day.
  • Lab 1 is the next assignment, and it's due Saturday at 11:59PM.
  • Homework 1 is also out, and it's due Tuesday at 11:59PM.
    • Make sure to finish Lab 1 before attempting Homework 1. Homeworks build off of the skills learned in labs.
  • Come to office hours for help!

Agenda¶

Today, we'll use a real dataset and lots of motivating questions to illustrate key DataFrame manipulation techniques.

Note:¶

  • Remember to check the Resources tab of the course website for programming resources.
  • Some key links moving forward:
    • DSC 10 Reference Sheet.
    • babypandas notes.
    • babypandas documentation.

DataFrames¶

pandas¶

  • DataFrames (tables) are provided by a package called pandas.
  • pandas is the tool for doing data science in Python.

But pandas is not so cute...¶

Enter babypandas!¶

  • We at UCSD have created a smaller, nicer version of pandas called babypandas.
  • It keeps the important stuff and has much better error messages.
  • It's easier to learn, but is still valid pandas code.

DataFrames in babypandas 🐼¶

  • Tables in babypandas (and pandas) are called "DataFrames."
  • To use DataFrames, we'll need to import babypandas. (We'll need numpy as well.)
In [1]:
import babypandas as bpd
import numpy as np

About the Data: Get It Done 👷¶

  • We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
  • The file data/get-it-done-jan-16.csv contains service requests made on January 16, 2023 through the Get It Done program.
  • Get It Done allows the general public to report non-emergency problems to the City of San Diego through a mobile app, website, or phone call.

Reading data from a file 📖¶

We can read in a CSV using bpd.read_csv(...). Give it the path to a file relative to your notebook (if the file is in the same folder as your notebook, this is just the name of the file).

In [2]:
jan_16 = bpd.read_csv('data/get-it-done-jan-16.csv')
jan_16
Out[2]:
service_request_id date_requested neighborhood service status street_address public_description
0 4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
1 4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
2 4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
3 4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4 4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ... ...
1355 4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
1356 4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
1357 4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
1358 4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
1359 4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 7 columns

Structure of a DataFrame¶

  • DataFrames have columns and rows.
    • Think of each column as an array. Columns contain data of the same type.
  • Each column has a label, e.g. 'neighborhood' and 'status'.
    • A column's label is its name.
    • Column labels are stored as strings.
  • Each row has a label too.
    • Together, the row labels are called the index. The index is not a column!
In [3]:
# This DataFrame has 1360 rows and 7 columns
jan_16
Out[3]:
service_request_id date_requested neighborhood service status street_address public_description
0 4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
1 4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
2 4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
3 4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4 4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ... ...
1355 4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
1356 4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
1357 4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
1358 4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
1359 4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 7 columns

Setting a new index¶

  • We can set a better index using .set_index(column_name).
  • Row labels should be unique identifiers.
    • Row labels are row names; ideally, each row has a different, descriptive name.
  • ⚠️ Like most DataFrame methods, .set_index returns a new DataFrame; it does not modify the original DataFrame.
In [4]:
jan_16.set_index('service_request_id')
Out[4]:
date_requested neighborhood service status street_address public_description
service_request_id
4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ...
4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 6 columns

In [5]:
jan_16
Out[5]:
service_request_id date_requested neighborhood service status street_address public_description
0 4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
1 4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
2 4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
3 4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4 4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ... ...
1355 4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
1356 4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
1357 4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
1358 4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
1359 4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 7 columns

In [6]:
jan_16 = jan_16.set_index('service_request_id')
jan_16
Out[6]:
date_requested neighborhood service status street_address public_description
service_request_id
4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ...
4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 6 columns

Shape of a DataFrame¶

  • .shape returns the number of rows and columns in a given DataFrame.
  • Access each with []:
    • .shape[0] for rows.
    • .shape[1] for columns.
In [7]:
# There were 7 columns before, but one of them became the index, and the index is not a column!
jan_16.shape
Out[7]:
(1360, 6)
In [8]:
# Number of rows
jan_16.shape[0]
Out[8]:
1360
In [9]:
# Number of columns
jan_16.shape[1]
Out[9]:
6

Annual summary of Get It Done requests¶

  • The file data/get-it-done-requests.csv contains a summary of all Get It Done requests submitted so far in 2023.
  • This whole year's worth of data shows the types of problems being reported in each neighborhood and how many service requests are resolved ('closed') versus unresolved ('open').
In [10]:
requests = bpd.read_csv('data/get-it-done-requests.csv')
requests
Out[10]:
neighborhood service closed open
0 Balboa Park Dead Animal 4 0
1 Balboa Park Encampment 22 16
2 Balboa Park Environmental Services Code Compliance 0 1
3 Balboa Park Graffiti - Code Enforcement 0 1
4 Balboa Park Graffiti - Public 62 37
... ... ... ... ...
1057 Uptown Tree Maintenance 19 29
1058 Uptown Vegetation Encroachment 1 2
1059 Uptown Waste on Private Property 3 1
1060 Uptown Weed Cleanup 1 0
1061 Via De La Valle Pothole 0 2

1062 rows × 4 columns

Example 1: Total requests¶

Key concepts: Accessing columns, performing operations with them, and adding new columns.

Finding total requests¶

  • Question: How many service requests of each type in each neighborhood have been made this year?
  • We have, separately, the number of closed service requests and open service requests of each type in each neighborhood.
  • Workflow:
    • Get the column of closed requests.
    • Get the column of open requests.
    • Add these columns element-wise.
    • Add a new column to the DataFrame with these totals.

Step 1 – Getting a column¶

  • We can get a column from a DataFrame using .get(column_name).
  • ⚠️ Column names are case sensitive!
  • Column names are strings, so we need to use quotes.
  • The result looks like a 1-column DataFrame, but is actually a Series.
In [11]:
requests
Out[11]:
neighborhood service closed open
0 Balboa Park Dead Animal 4 0
1 Balboa Park Encampment 22 16
2 Balboa Park Environmental Services Code Compliance 0 1
3 Balboa Park Graffiti - Code Enforcement 0 1
4 Balboa Park Graffiti - Public 62 37
... ... ... ... ...
1057 Uptown Tree Maintenance 19 29
1058 Uptown Vegetation Encroachment 1 2
1059 Uptown Waste on Private Property 3 1
1060 Uptown Weed Cleanup 1 0
1061 Via De La Valle Pothole 0 2

1062 rows × 4 columns

In [12]:
requests.get('closed')
Out[12]:
0        4
1       22
2        0
3        0
4       62
        ..
1057    19
1058     1
1059     3
1060     1
1061     0
Name: closed, Length: 1062, dtype: int64

Digression: Series¶

  • A Series is like an array, but with an index.
  • In particular, Series support arithmetic.
In [13]:
requests.get('closed')
Out[13]:
0        4
1       22
2        0
3        0
4       62
        ..
1057    19
1058     1
1059     3
1060     1
1061     0
Name: closed, Length: 1062, dtype: int64

Step 2 – Getting another column¶

In [14]:
requests.get('open')
Out[14]:
0        0
1       16
2        1
3        1
4       37
        ..
1057    29
1058     2
1059     1
1060     0
1061     2
Name: open, Length: 1062, dtype: int64

Step 3 – Calculating the total¶

  • Just like with arrays, we can perform arithmetic operations with two Series, as long as they have the same length and same index.
  • Operations happen element-wise.
  • The result is also a Series.
In [15]:
requests.get('closed') + requests.get('open')
Out[15]:
0        4
1       38
2        1
3        1
4       99
        ..
1057    48
1058     3
1059     4
1060     1
1061     2
Length: 1062, dtype: int64

Step 4 – Adding the totals to the DataFrame as a new column¶

  • Use .assign(name_of_column=data_in_series) to assign a Series (or array, or list) to a DataFrame.
  • ⚠️ Don't put quotes around name_of_column.
  • Creates a new DataFrame! Must save to variable.
In [16]:
requests.assign(
    total=requests.get('closed') + requests.get('open')
)
Out[16]:
neighborhood service closed open total
0 Balboa Park Dead Animal 4 0 4
1 Balboa Park Encampment 22 16 38
2 Balboa Park Environmental Services Code Compliance 0 1 1
3 Balboa Park Graffiti - Code Enforcement 0 1 1
4 Balboa Park Graffiti - Public 62 37 99
... ... ... ... ... ...
1057 Uptown Tree Maintenance 19 29 48
1058 Uptown Vegetation Encroachment 1 2 3
1059 Uptown Waste on Private Property 3 1 4
1060 Uptown Weed Cleanup 1 0 1
1061 Via De La Valle Pothole 0 2 2

1062 rows × 5 columns

In [17]:
requests
Out[17]:
neighborhood service closed open
0 Balboa Park Dead Animal 4 0
1 Balboa Park Encampment 22 16
2 Balboa Park Environmental Services Code Compliance 0 1
3 Balboa Park Graffiti - Code Enforcement 0 1
4 Balboa Park Graffiti - Public 62 37
... ... ... ... ...
1057 Uptown Tree Maintenance 19 29
1058 Uptown Vegetation Encroachment 1 2
1059 Uptown Waste on Private Property 3 1
1060 Uptown Weed Cleanup 1 0
1061 Via De La Valle Pothole 0 2

1062 rows × 4 columns

In [18]:
requests = requests.assign(
    total=requests.get('closed') + requests.get('open')
)
requests
Out[18]:
neighborhood service closed open total
0 Balboa Park Dead Animal 4 0 4
1 Balboa Park Encampment 22 16 38
2 Balboa Park Environmental Services Code Compliance 0 1 1
3 Balboa Park Graffiti - Code Enforcement 0 1 1
4 Balboa Park Graffiti - Public 62 37 99
... ... ... ... ... ...
1057 Uptown Tree Maintenance 19 29 48
1058 Uptown Vegetation Encroachment 1 2 3
1059 Uptown Waste on Private Property 3 1 4
1060 Uptown Weed Cleanup 1 0 1
1061 Via De La Valle Pothole 0 2 2

1062 rows × 5 columns

Example 2: Analyzing requests¶

Key concept: Computing statistics of columns using Series methods.

Questions¶

  • What is the largest number of service requests for any one service in any one neighborhood?
  • What is a typical number of service requests for any one service in any one neighborhood?

Series, like arrays, have helpful methods, including .min(), .max(), and .mean().

In [19]:
requests.get('total').max()
Out[19]:
842

What is it that people are reporting so frequently, and where? We'll see how to find out shortly!

Other statistics:

In [20]:
requests.get('total').mean()
Out[20]:
17.718455743879474
In [21]:
requests.get('total').median()
Out[21]:
5.0
In [22]:
requests.get('open').mean()
Out[22]:
10.002824858757062
In [23]:
requests.get('open').median()
Out[23]:
2.0

Example 3: What and where is the most frequently requested service?¶

Key concepts: Sorting. Accessing using integer positions.

Step 1 – Sorting the DataFrame¶

  • Use the .sort_values(by=column_name) method to sort.
    • The by= can be omitted, but helps readability.
  • Like most DataFrame methods, this returns a new DataFrame.
In [24]:
requests.sort_values(by='total')
Out[24]:
neighborhood service closed open total
483 Miramar Ranch North Weed Cleanup 1 0 1
627 Old Town San Diego Missed Collection 0 1 1
876 Serra Mesa Street Light Maintenance 0 1 1
167 Downtown Right-of-Way Code Enforcement 0 1 1
866 Serra Mesa Environmental Services Code Compliance 1 0 1
... ... ... ... ... ...
916 Southeastern San Diego Illegal Dumping 338 50 388
108 Clairemont Mesa Pothole 46 381 427
173 Downtown Street Light Maintenance 10 420 430
318 Mid-City:City Heights Illegal Dumping 451 31 482
155 Downtown Encampment 315 527 842

1062 rows × 5 columns

This sorts, but in ascending order (small to large). We want the opposite!

Step 1 – Sorting the DataFrame in descending order¶

  • Use .sort_values(by=column_name, ascending=False) to sort in descending order.
  • ascending is an optional argument. If omitted, it will be set to True by default.
    • This is an example of a keyword argument, or a named argument.
    • If we want to specify the sorting order, we must use the keyword ascending=.
In [25]:
ordered_requests = requests.sort_values(by='total', ascending=False)
ordered_requests
Out[25]:
neighborhood service closed open total
155 Downtown Encampment 315 527 842
318 Mid-City:City Heights Illegal Dumping 451 31 482
173 Downtown Street Light Maintenance 10 420 430
108 Clairemont Mesa Pothole 46 381 427
916 Southeastern San Diego Illegal Dumping 338 50 388
... ... ... ... ... ...
876 Serra Mesa Street Light Maintenance 0 1 1
878 Serra Mesa Traffic Engineering 0 1 1
470 Mira Mesa Waste on Private Property 0 1 1
880 Serra Mesa Traffic Signal Issue 1 0 1
531 Mission Valley Sidewalk Repair Issue 0 1 1

1062 rows × 5 columns

In [26]:
# We must specify the role of False by using ascending=, 
# otherwise Python does not know how to interpret this.
ordered_requests = requests.sort_values(by='total', False)
ordered_requests
  File "/var/folders/ch/hyjw6whx3g9gshnp58738jc80000gp/T/ipykernel_12154/1443585216.py", line 3
    ordered_requests = requests.sort_values(by='total', False)
                                                             ^
SyntaxError: positional argument follows keyword argument

Step 2 – Extracting the neighborhood and service¶

  • We saw that the most reported issue is 'Encampment' in 'Downtown', but how do we extract that information using code?
  • First, grab an entire column as a Series.
  • Navigate to a particular entry of the Series using .iloc[integer_position].
In [27]:
ordered_requests
Out[27]:
neighborhood service closed open total
155 Downtown Encampment 315 527 842
318 Mid-City:City Heights Illegal Dumping 451 31 482
173 Downtown Street Light Maintenance 10 420 430
108 Clairemont Mesa Pothole 46 381 427
916 Southeastern San Diego Illegal Dumping 338 50 388
... ... ... ... ... ...
876 Serra Mesa Street Light Maintenance 0 1 1
878 Serra Mesa Traffic Engineering 0 1 1
470 Mira Mesa Waste on Private Property 0 1 1
880 Serra Mesa Traffic Signal Issue 1 0 1
531 Mission Valley Sidewalk Repair Issue 0 1 1

1062 rows × 5 columns

In [28]:
ordered_requests.get('neighborhood')
Out[28]:
155                  Downtown
318     Mid-City:City Heights
173                  Downtown
108           Clairemont Mesa
916    Southeastern San Diego
                ...          
876                Serra Mesa
878                Serra Mesa
470                 Mira Mesa
880                Serra Mesa
531            Mission Valley
Name: neighborhood, Length: 1062, dtype: object
In [29]:
ordered_requests.get('neighborhood').iloc[0]
Out[29]:
'Downtown'
In [30]:
ordered_requests.get('service').iloc[0]
Out[30]:
'Encampment'

Example 4: Status of a request¶

Key concept: Accessing using row labels.

Status of a request¶

  • On January 16, you submitted service request 4065843. Has the issue been resolved?

  • This cannot be answered from the annual summary data, but must be answered from the detailed data about January 16.

In [31]:
jan_16
Out[31]:
date_requested neighborhood service status street_address public_description
service_request_id
4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ...
4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 6 columns

Your service request is buried in the middle of the DataFrame. Only the first few rows and last few rows are shown, so you can't tell just by looking at the DataFrame.

Accessing using the row label¶

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:

  1. Use .get(column_name) to extract the entire column as a Series.
  2. Use .loc[] to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [32]:
jan_16.get('status')
Out[32]:
service_request_id
4065025      open
4065026      open
4065027      open
4065028      open
4065029      open
            ...  
4066532    closed
4066544    closed
4066547    closed
4066564    closed
4066565    closed
Name: status, Length: 1360, dtype: object
In [33]:
jan_16.get('status').loc[4065843]
Out[33]:
'open'

Activity 🌧️¶

Oh no, your service request 4065843 has still not been resolved! What was the problem again?

Write one line of code that evaluates to the full description of the problem, as you described it in your service request.

In [34]:
...
Out[34]:
Ellipsis

Summary of accessing a Series¶

  • There are two ways to get an element of a Series:
    • .loc[] uses the row label.
    • .iloc[] uses the integer position.
  • Usually .loc[] is more convenient, but each is best for different scenarios.

Note¶

  • Sometimes the integer position and row label are the same.
  • This happens by default with bpd.read_csv.
In [35]:
bpd.read_csv('data/get-it-done-jan-16.csv')
Out[35]:
service_request_id date_requested neighborhood service status street_address public_description
0 4065025 2023-01-16T00:03:00 Pacific Beach Parking open 4630 Bayard St Cars blocking alleyway, not on private propert...
1 4065026 2023-01-16T00:33:00 Otay Mesa-Nestor Encampment open 615 Saturn Blvd, San Diego, CA 92154, USA There are encampments located behind I - hop, ...
2 4065027 2023-01-16T00:37:00 Otay Mesa-Nestor Encampment open 1862 Palm Ave, San Diego, CA 92154, USA Encampments with large group of encampment ind...
3 4065028 2023-01-16T00:46:00 Old Town San Diego Street Sweeping open 4400 TAYLOR ST, 92110 ROCKSLIDE- COVERING BOTH LANES - BOBCAT NEEDED...
4 4065029 2023-01-16T00:52:00 Mid-City:City Heights Other open 3860 Cherokee Ave House near me has a car that multiple times a...
... ... ... ... ... ... ... ...
1355 4066532 2023-01-16T22:09:00 Navajo Other closed 6735 Wandermere Dr Cars doing donuts in st Dunsteen's church park...
1356 4066544 2023-01-16T22:19:00 Mid-City:City Heights Other closed 3949 33rd Street Destruction of property. Man camped on 33rd St...
1357 4066547 2023-01-16T22:21:00 Torrey Pines Pothole closed I-5, Del Mar, CA 92014, USA There is a big pot hole that made me pop my ti...
1358 4066564 2023-01-16T22:43:00 Southeastern San Diego Pothole closed CA-94, San Diego, CA 92102, USA Pothole is getting bigger with the rain and yo...
1359 4066565 2023-01-16T22:43:00 Greater Golden Hill Pothole closed CA-94, San Diego, CA 92102, USA Pothole on the off-ramp to 30th. Popped my tir...

1360 rows × 7 columns

In [36]:
bpd.read_csv('data/get-it-done-jan-16.csv').get('public_description').loc[561]
Out[36]:
'Sidewalk covered in mud'
In [37]:
bpd.read_csv('data/get-it-done-jan-16.csv').get('public_description').iloc[561]
Out[37]:
'Sidewalk covered in mud'

Reflection¶

Questions we can answer right now...¶

  • What is the largest number of open requests of one type in one neighborhood?
    • requests.get('open').max().
  • How many requests were made on January 16?
    • jan_16.shape[0].
  • What is the description of the latest request made on January 16?
    • jan_16.sort_values(by='date_requested', ascending=False).get('public_description').iloc[0].

Moving forward, let's just focus on the requests DataFrame. As a reminder, here's what it looks like:

In [38]:
requests
Out[38]:
neighborhood service closed open total
0 Balboa Park Dead Animal 4 0 4
1 Balboa Park Encampment 22 16 38
2 Balboa Park Environmental Services Code Compliance 0 1 1
3 Balboa Park Graffiti - Code Enforcement 0 1 1
4 Balboa Park Graffiti - Public 62 37 99
... ... ... ... ... ...
1057 Uptown Tree Maintenance 19 29 48
1058 Uptown Vegetation Encroachment 1 2 3
1059 Uptown Waste on Private Property 3 1 4
1060 Uptown Weed Cleanup 1 0 1
1061 Via De La Valle Pothole 0 2 2

1062 rows × 5 columns

Questions we can't yet answer...¶

  • Which neighborhood has the most 'Tree Maintenance' requests?
  • What is the most commonly requested service in the 'University' neighborhood (near UCSD)?
  • In the 'Downtown' neighborhood, how many open service requests are there?

The common thread between these questions is that they all involve only a subset of the rows in our DataFrame.

Example 5: Which neighborhood has the most 'Tree Maintenance' requests? 🌳¶

Key concept: Querying

Selecting rows¶

  • We could determine the neighborhood with the most 'Tree Maintenance' requests if we had a DataFrame consisting of only these type of requests.
    • We would sort by the 'total' column in descending order, then extract the neighborhood name in the first row.
  • How do we get that DataFrame?

The solution¶

In [39]:
requests[requests.get('service') == 'Tree Maintenance']
Out[39]:
neighborhood service closed open total
21 Balboa Park Tree Maintenance 2 7 9
41 Barrio Logan Tree Maintenance 1 2 3
57 Black Mountain Ranch Tree Maintenance 2 0 2
95 Carmel Valley Tree Maintenance 6 2 8
122 Clairemont Mesa Tree Maintenance 14 15 29
... ... ... ... ... ...
974 Torrey Highlands Tree Maintenance 0 1 1
985 Torrey Hills Tree Maintenance 1 1 2
1006 Torrey Pines Tree Maintenance 1 2 3
1029 University Tree Maintenance 5 9 14
1057 Uptown Tree Maintenance 19 29 48

43 rows × 5 columns

Use == to check equality. Not =, as that's for assignment of a value to a variable.

In [40]:
'Tree Maintenance' == 'Weed Cleanup'
Out[40]:
False
In [41]:
'Tree Maintenance' == 'Tree Maintenance'
Out[41]:
True

We can broadcast the equality check to each element of a Series. The comparison happens element-wise.

In [42]:
requests.get('service') == 'Tree Maintenance'
Out[42]:
0       False
1       False
2       False
3       False
4       False
        ...  
1057     True
1058    False
1059    False
1060    False
1061    False
Name: service, Length: 1062, dtype: bool

Aside: Booleans¶

  • When we compare two values, the result is either True or False.
    • Notice, these words are not in quotes.
  • bool is a data type in Python, just like int, float, and str.
    • It stands for "Boolean", named after George Boole, an early mathematician.
  • There are only two possible Boolean values: True or False.
    • Yes or no.
    • On or off.
    • 1 or 0.
In [43]:
5 == 6
Out[43]:
False
In [44]:
True
Out[44]:
True

Comparison operators¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

When comparing an entire Series to a value, the result is a Series of bools.

In [45]:
requests.get('service') == 'Tree Maintenance'
Out[45]:
0       False
1       False
2       False
3       False
4       False
        ...  
1057     True
1058    False
1059    False
1060    False
1061    False
Name: service, Length: 1062, dtype: bool

What is a query? 🤔¶

  • A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
  • We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

How do we query a DataFrame?¶

To select only certain rows of requests:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
  2. Then pass it into requests[sequence_goes_here].
In [46]:
requests[requests.get('service') == 'Tree Maintenance']
Out[46]:
neighborhood service closed open total
21 Balboa Park Tree Maintenance 2 7 9
41 Barrio Logan Tree Maintenance 1 2 3
57 Black Mountain Ranch Tree Maintenance 2 0 2
95 Carmel Valley Tree Maintenance 6 2 8
122 Clairemont Mesa Tree Maintenance 14 15 29
... ... ... ... ... ...
974 Torrey Highlands Tree Maintenance 0 1 1
985 Torrey Hills Tree Maintenance 1 1 2
1006 Torrey Pines Tree Maintenance 1 2 3
1029 University Tree Maintenance 5 9 14
1057 Uptown Tree Maintenance 19 29 48

43 rows × 5 columns

Another example of element-wise comparison¶

This time, we'll find the neighborhoods and services with at least one open request.

In [47]:
requests
Out[47]:
neighborhood service closed open total
0 Balboa Park Dead Animal 4 0 4
1 Balboa Park Encampment 22 16 38
2 Balboa Park Environmental Services Code Compliance 0 1 1
3 Balboa Park Graffiti - Code Enforcement 0 1 1
4 Balboa Park Graffiti - Public 62 37 99
... ... ... ... ... ...
1057 Uptown Tree Maintenance 19 29 48
1058 Uptown Vegetation Encroachment 1 2 3
1059 Uptown Waste on Private Property 3 1 4
1060 Uptown Weed Cleanup 1 0 1
1061 Via De La Valle Pothole 0 2 2

1062 rows × 5 columns

In [48]:
requests.get('open') > 1
Out[48]:
0       False
1        True
2       False
3       False
4        True
        ...  
1057     True
1058     True
1059    False
1060    False
1061     True
Name: open, Length: 1062, dtype: bool
In [49]:
requests[requests.get('open') > 1]
Out[49]:
neighborhood service closed open total
1 Balboa Park Encampment 22 16 38
4 Balboa Park Graffiti - Public 62 37 99
5 Balboa Park Illegal Dumping 3 4 7
6 Balboa Park Other 3 4 7
7 Balboa Park Parking 2 2 4
... ... ... ... ... ...
1055 Uptown Traffic Signal Timing 4 2 6
1056 Uptown Trash/Recycling Collection 1 2 3
1057 Uptown Tree Maintenance 19 29 48
1058 Uptown Vegetation Encroachment 1 2 3
1061 Via De La Valle Pothole 0 2 2

644 rows × 5 columns

Original Question: Which neighborhood has the most 'Tree Maintenance' requests?¶

Strategy:

  1. Extract a DataFrame of just the 'Tree Maintenance' requests.
  2. Sort by 'total' in descending order.
  3. Return the first element in the 'neighborhood' column.
In [50]:
tree_maintenance_only = requests[requests.get('service') == 'Tree Maintenance']
tree_maintenance_sorted = tree_maintenance_only.sort_values(by='total', ascending=False)
tree_maintenance_sorted
Out[50]:
neighborhood service closed open total
935 Southeastern San Diego Tree Maintenance 12 44 56
596 North Park Tree Maintenance 24 30 54
1057 Uptown Tree Maintenance 19 29 48
179 Downtown Tree Maintenance 15 28 43
712 Pacific Beach Tree Maintenance 22 15 37
... ... ... ... ... ...
497 Mission Bay Park Tree Maintenance 0 1 1
963 Tijuana River Valley Tree Maintenance 0 1 1
974 Torrey Highlands Tree Maintenance 0 1 1
638 Old Town San Diego Tree Maintenance 1 0 1
518 Mission Beach Tree Maintenance 1 0 1

43 rows × 5 columns

In [51]:
tree_maintenance_sorted.get('neighborhood').iloc[0]
Out[51]:
'Southeastern San Diego'

What if the condition isn't satisfied?¶

In [52]:
requests[requests.get('service') == 'Car Maintenance']
Out[52]:
neighborhood service closed open total

Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

In [53]:
...
Out[53]:
Ellipsis

Activity 🚘¶

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

In [54]:
...
Out[54]:
Ellipsis

Summary¶

Summary¶

  • We learned many DataFrame methods and techniques.
  • Don't feel the need to memorize them all right away.
  • Instead, refer to this lecture, the DSC 10 reference sheet, the babypandas notes, and the babypandas documentation when working on assignments.
  • Over time, these techniques will become more and more familiar.
  • Practice! Frame your own questions using this dataset and try to answer them.

Next time¶

  • We'll answer more complicated questions, which will lead us to a new core DataFrame method, .groupby, for organizing rows of a DataFrame with the same value in a particular column.
  • For example, we might want to organize the data by neighborhood, collecting all the different service requests for each neighborhood.