Lecture 2 – Pandas 🐼

DSC 80, Spring 2022

Announcements

Agenda

The data science lifecycle

Recap: City of San Diego salary data

Our dataset is downloaded from Transparent California.

Question: Does gender influence pay?

Social Security Administration baby names 👶

We began compiling the baby name list in 1997, with names dating back to 1880. At the time of a child’s birth, parents supply the name to us when applying for a child’s Social Security card, thus making Social Security America’s source for the most popular baby names. Please share this with your friends and family—and help us spread the word on social media. - Social Security’s Top Baby Names for 2020

Exploring names

Data Modeling

Determining the most common gender for each name

Determining the most common gender for each name

Adding a 'gender' column to salaries

This involves two steps:

  1. Extracting just the first name from 'Employee Name'.
  2. Merging salaries and genders.

Predictions and Inference

Question: Does gender influence pay?

This was our original question. Let's find out!

A hypothesis test

Strategy:

Running the hypothesis test

Even more questions...

While trying to answer one question, many more popped up.

Is our dataset representative of all San Diego employees?

How reliable is our join between salaries and names?

How reliable is our join between salaries and names?

How reliable is our join between salaries and names?

Lesson: joining to another dataset can bias your sample!

Introduction to pandas

pandas

pandas data structures

There are three key data structures at the core of pandas:

We've already run this at the top of the notebook, so we won't repeat it here. But pandas is almost always imported in conjunction with numpy:

import pandas as pd
import numpy as np

Series are "slices"

Initializing a Series

Initializing a DataFrame

Method 1: Using a list of rows

By default, the column names are set to 0, 1, 2, ...

You can change that using the columns argument.

Method 2: Using a dictionary of columns

DataFrame index and column labels

Axis

DataFrame methods with axis

If we specify axis=0, A.sum will "compress" along axis 0, and keep the column labels intact.

If we specify axis=1, A.sum will "compress" along axis 1, and keep the row labels (index) intact.

Selecting rows and columns using [] and loc

Throwback to babypandas 👶

Selecting columns with []

Selecting columns with attribute notation

Selecting rows with loc

If df is a DataFrame, then:

Boolean sequence selection

Querying

When using a Boolean sequence, e.g. enrollments['Name'] < 'M', loc is not strictly necessary:

Selecting columns and rows simultaneously

So far, we used [] to select columns and loc to select rows.

Selecting columns and rows simultaneously

loc can also be used to select both rows and columns. The general pattern is:

df.loc[<row selector>, <column selector>]

Examples:

Even more ways of selecting rows and columns

In df.loc[<row selection>, <column selection>]:

There are many, many more – see the pandas documentation for more.

Don't forget iloc!

Discussion Question

Let's return to the names DataFrame.

Question: How many babies were born with the name 'Billy' and gender 'M'?

More Practice

Consider the DataFrame below.

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself.

Summary, next time

Summary