Lecture 2 – DataFrame Fundamentals

DSC 80, Spring 2023

Announcements 📣

Agenda

Remember, we are not going to cover every single detail! The pandas documentation will be your friend.

Introduction to pandas 🐼

Baby pandas

pandas

pandas

pandas data structures

There are three key data structures at the core of pandas:

We've already run this at the top of the notebook, so we won't repeat it here. But pandas is almost always imported in conjunction with numpy:


import pandas as pd
import numpy as np

Example: Universities in California 📚

To refresh our memory on the basics of pandas, let's work with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.

Exploring a new DataFrame

To extract the first or last few rows of a DataFrame, use the head or tail methods.

The shape attribute returns the DataFrame's number of rows and columns.

The anatomy of a DataFrame

Each row and column of a DataFrame is a Series.

Sorting

The order of the rows in schools does not seem to be meaningful right now. To sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.

Setting the index

Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.

Selecting columns

Selecting columns in babypandas 👶🐼

Selecting columns with []

Useful Series methods

There are a variety of useful methods that work on Series. You can see the entire list here. As we'll see next lecture, many of these methods work on DataFrames directly, too – how?

Selecting rows (and columns)

Using loc to select rows using row labels

If df is a DataFrame, then:

Boolean indexing

Querying

Note that because we set the index to 'Name' earlier, we can select rows based on school names without having to query.

If 'Name' was instead a column, then we'd need to query to access information about a particular school.

Discussion Question

Write an expression that evaluates to the number of UC schools founded after 1950.

Selecting columns and rows simultaneously

So far, we used [] to select columns and loc to select rows.

For instance, to find the cities for all schools in San Diego county:

Selecting columns and rows simultaneously

loc can also be used to select both rows and columns. The general pattern is:

df.loc[<row selector>, <column selector>]

Examples:

There are many, many more – see the pandas documentation for more.

Don't forget iloc!

iloc is often most useful when we sort first. For instance, to find the enrollment of the youngest school in the dataset:

More Practice

Consider the DataFrame below.

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own.

Summary, next time

Summary

Next time