Lecture 5 – DataFrames: Accessing, Sorting, and Querying

DSC 10, Fall 2022

Announcements

Agenda

Today, we'll use a real dataset and lots of motivating questions to illustrate key DataFrame manipulation techniques.

Note:

DataFrames

pandas

But pandas is not so cute...

Enter babypandas!

DataFrames in babypandas 🐼

About the Data: Get It Done 👷

Reading data from a file 📖

We can read in a CSV using bpd.read_csv(...). Give it the path to a file relative to your notebook (if the file is in the same folder as your notebook, this is just the name of the file).

Structure of a DataFrame

Setting a new index

Shape of a DataFrame

Annual summary of Get It Done requests

Example 1: Total requests

Key concepts: Accessing columns, performing operations with them, and adding new columns.

Finding total requests

Step 1 – Getting a column

Digression: Series

Step 2 – Getting another column

Step 3 – Calculating the total

Step 4 – Adding the totals to the DataFrame as a new column

Example 2: Analyzing requests

Key concept: Computing statistics of columns using Series methods.

Questions

Series, like arrays, have helpful methods, including .min(), .max(), and .mean().

What is it that people are reporting so frequently, and where? We'll see how to find out shortly!

Other statistics:

Example 3: What and where is the most frequently requested service?

Key concepts: Sorting. Accessing using integer positions.

Step 1 – Sorting the DataFrame

This sorts, but in ascending order (small to large). We want the opposite!

Step 1 – Sorting the DataFrame in descending order

Step 2 – Extracting the neighborhood and service

Example 4: Status of a request

Key concept: Accessing using row labels.

Status of a request

Your service request is buried in the middle of the DataFrame. Only the first few rows and last few rows are shown, so you can't tell just by looking at the DataFrame.

Accessing using the row label

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:

  1. Use .get(column_name) to extract the entire column as a Series.
  2. Use .loc[] to access the element of a Series with a particular row label.

In this class, we'll always get a column, then a row (but row, then column is also possible).

Activity 🚚

Oh no, your service request 3940652 has still not been resolved! What was the problem again?

Write one line of code that evaluates to the full description of the problem, as you described it in your service request.

Summary of accessing a Series

Note

Reflection

Questions we can answer right now...

Moving forward, let's just focus on the requests DataFrame. As a reminder, here's what it looks like:

Questions we can't yet answer...

The common thread between these questions is that they all involve only a subset of the rows in our DataFrame.

Example 6: Which neighborhood has the most 'Weed Cleanup' requests?

Key concept: Selecting rows (via Boolean indexing).

Selecting rows

The solution

Use == to check equality. Not =, as that's for assignment of a value to a variable.

We can broadcast the equality check to each element of a Series. The comparison happens element-wise.

Boolean indexing

To select only some rows of requests:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss).

    • The values True and False are of the Boolean data type.
  2. Then pass it into requests[sequence_goes_here].

Rather than making the sequence by hand, we usually generate it by making a comparison.

Another example of element-wise comparison

Comparisons can check inequality as well as equality.

Original Question: Which neighborhood has the most 'Weed Cleanup' requests?

Strategy:

  1. Extract a DataFrame of just the 'Weed Cleanup' requests.
  2. Sort by 'total' in descending order.
  3. Return the first element in the 'neighborhood' column.

What if the condition isn't satisfied?

Concept Check ✅ – Answer at cc.dsc10.com

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

Activity 🚘

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

Summary

Summary

Next time

We'll answer more complicated questions, which will lead us to a new core DataFrame method, .groupby.