Lecture 3 – More Pandas 🐼🐼

DSC 80, Spring 2022

Announcements

Agenda

loc and iloc

Selecting columns and rows simultaneously

So far, we used [] to select columns and loc to select rows.

Selecting columns and rows simultaneously

loc can also be used to select both rows and columns. The general pattern is:

df.loc[<row selector>, <column selector>]

Examples:

Even more ways of selecting rows and columns

In df.loc[<row selection>, <column selection>]:

There are many, many more – see the pandas documentation for more.

Don't forget iloc!

Practice Questions

Consider the DataFrame below.

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself.

Pandas and NumPy

NumPy

pandas is built upon numpy

The dangers of for-loops

Aside: generating data

Next, let's define a function that takes in a DataFrame like the one above and returns the distances between each point and the origin, using a for-loop.

The %timeit magic command can repeatedly run any snippet of code and give us its average runtime.

Now, using a vectorized approach:

Note that "µs" refers to microseconds, which are one-millionth of a second, whereas "ms" refers to milliseconds, which are one-thousandth of a second.

Takeaway: avoid for-loops whenever possible!

pandas data types

pandas data types

Pandas dtype Python type NumPy type SQL type Usage
int64 int int_, int8,...,int64, uint8,...,uint64 INT, BIGINT Integer numbers
float64 float float_, float16, float32, float64 FLOAT Floating point numbers
bool bool bool_ BOOL True/False values
datetime64 NA datetime64[ns] DATETIME Date and time values
timedelta[ns] NA NA NA Differences between two datetimes
category NA NA ENUM Finite list of text values
object str string, unicode NA Text
object NA object NA Mixed types

This article details how pandas stores different data types under the hood.

Type conversion and the underlying numpy array(s)

What do you think is happening here?

⚠️ Warning: numpy and pandas don't always make the same decisions!

numpy prefers homogenous data types to optimize memory and read/write speed. This leads to type coercion. Notice that the array created below contains only strings, even though there was an int in the argument list.

On the other hand, pandas likes correctness and ease-of-use. The Series created below is of type object, which preserves the original data types in the argument list.

You can specify the data type of an array when initializing it by using the dtype argument.

pandas does make some trade-offs for efficiency, however. For instance, a Series consisting of both ints and floats is coerced to the float64 data type.

Type conversion

You can change the data type of a Series using the .astype Series method.

Performance and memory management

As we just discovered,

To demonstrate, let's create a large array in which all of the entries are non-negative numbers less than 255, meaning that they can be represented with 8 bits (i.e. as np.uint8s, where the "u" stands for "unsigned").

When we tell pandas to use a dtype of uint8, the size of the resulting DataFrame is under a megabyte.

But by default, even though the numbers are only 8-bit, pandas uses the int64 dtype, and the resulting DataFrame is over 7 megabytes large.

Useful Series and DataFrame methods

Shared methods and attributes

Series methods

Method Name Description
count Returns the number of non-null entries in the Series
unique Returns the unique values in the Series
nunique Returns the number of unique values in the Series
value_counts Returns a Series of counts of unique values
describe Returns a Series of descriptive stats of values

DataFrame methods

The following piece of code works, but is meaningless. Why?

Even more DataFrame methods

Method Name Description
sort_values Returns a DataFrame sorted by the specified column
drop_duplicates Returns a DataFrame with duplicate values dropped
describe Returns descriptive stats of the DataFrame

Adding and modifying columns, using a copy

As an aside, you should try your best to write chained pandas code, as follows:

You can chain together several steps at a time:

You can also use assign when the desired column name has spaces, by using keyword arguments.

⚠️ Warning!

Adding and modifying columns, in-place

Example: San Diego employee salaries (again)

Note: We probably won't finish looking at all of this code in lecture, but we will leave it here for you as a reference.

Reading the data

Let's work with the same dataset that we did in Lecture 1, using our new knowledge of pandas.

Data cleaning

Current issues with the dataset:

Fixing the 'Other Pay' column

It appears that most of the values in the 'Other Pay' column are strings containing numbers. Which values are not numbers?

We can keep just the rows where the 'Other Pay' is numeric, and then convert the 'Other Pay' column to float.

The line of code above is correct, but it errors if you run it more than once. Why? 🤔

Full-time vs. part-time

What happens when we use normalize=True with value_counts?

Salary analysis

Question: Is 'Total Pay' equal to the sum of 'Base Pay', 'Overtime Pay', and 'Other Pay'?

We can answer this by summing the latter three columns and seeing if the resulting Series equals the former column.

Similarly, we might ask whether 'Total Pay & Benefits' is truly the sum of 'Total Pay' and 'Benefits'.

Visualization

Think of your own questions about the dataset, and try and answer them!

Summary, next time

Summary