Lecture 3 – More DataFrame Fundamentals

DSC 80, Spring 2023

Agenda

Recap: loc and iloc

Example: Universities in California 📚

Recall, last lecture we started working with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.

loc and iloc with the default index

What's the difference between the two DataFrames below?

Which of the following two expressions evaluate to the name of the youngest school in schools?

Adding and modifying columns

Adding and modifying columns, using a copy

As an aside, you should try your best to write chained pandas code, as follows:

You can also use assign when the desired column name has spaces, by using keyword arguments.

Adding and modifying columns, in-place

Note that we never reassigned schools_copy in the two cells above – that is, we never wrote schools_copy = ... – though it was still modified.

Mutability

DataFrames, like lists, arrays, and dictionaries, are mutable. As you learned in DSC 20, this means that they can be modified after being created.

Not only does this explain the behavior on the previous slide, but it also explains the following:

Note that schools was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on labs and projects, among other things!

To avoid this, it's a good idea to include df = df.copy() as the first line in functions that take DataFrames as input.

What about rows?

You can add and modify rows using loc and iloc. There's a function that can be to add rows, called pd.concat; we'll see it in a few lectures.

Axes

Axes

DataFrame methods with axis

Consider the DataFrame A defined below using a dictionary.

If we specify axis=0, A.sum will "compress" along axis 0, and keep the column labels intact.

If we specify axis=1, A.sum will "compress" along axis 1, and keep the row labels (index) intact.

What's the default axis?

DataFrame methods with axis

Discussion Question

In words, what characteristic do all schools in the following DataFrame share?

schools[schools.nunique(axis=1) != schools.nunique(axis=1).max()]

Hint: What city is SDSU in? What county is it in?

pandas and numpy

numpy

pandas is built upon numpy

Even though conv appears to be "detached" from ser, it is not:

The dangers of for-loops

Aside: Generating data

Next, let's define a function that takes in a DataFrame like coordinates and returns the distances between each point and the origin, using a for-loop.

The %timeit magic command can repeatedly run any snippet of code and give us its average runtime.

Now, using a vectorized approach:

Note that "µs" refers to microseconds, which are one-millionth of a second, whereas "ms" refers to milliseconds, which are one-thousandth of a second.

Takeaway: Avoid for-loops whenever possible!

pandas data types

pandas data types

Pandas dtype Python type NumPy type SQL type Usage
int64 int int_, int8,...,int64, uint8,...,uint64 INT, BIGINT Integer numbers
float64 float float_, float16, float32, float64 FLOAT Floating point numbers
bool bool bool_ BOOL True/False values
datetime64 NA datetime64[ns] DATETIME Date and time values
timedelta[ns] NA NA NA Differences between two datetimes
category NA NA ENUM Finite list of text values
object str string, unicode NA Text
object NA object NA Mixed types

This article details how pandas stores different data types under the hood.

What do you think is happening here? 🚰

Read this article for a discussion of how numpy/pandas int64 operations differ from vanilla int operations.

⚠️ Warning: numpy and pandas don't always make the same decisions!

numpy prefers homogenous data types to optimize memory and read/write speed. This leads to type coercion.

Notice that the array created below contains only strings, even though there was an int in the argument list.

On the other hand, pandas likes correctness and ease-of-use. The Series created below is of type object, which preserves the original data types in the argument list.

You can specify the data type of an array when initializing it by using the dtype argument.

pandas does make some trade-offs for efficiency, however. For instance, a Series consisting of both ints and floats is coerced to the float64 data type.

Type conversion

You can change the data type of a Series using the .astype Series method.

For instance, we can change the data type of the 'Enrollment' column in schools to be int64, once we remove the commas.

Performance and memory management

As we just discovered,

To demonstrate, let's create a large array in which all of the entries are non-negative numbers less than 255, meaning that they can be represented with 8 bits (i.e. as np.uint8s, where the "u" stands for "unsigned").

When we tell pandas to use a dtype of uint8, the size of the resulting DataFrame is under a megabyte.

But by default, even though the numbers are only 8-bit, pandas uses the int64 dtype, and the resulting DataFrame is over 7 megabytes large.

Aside: std

To compute the standard deviation of a Series, we can use:

Let's try both. What do you notice?

Aside: std

The two methods/functions use different degrees of freedom (ddof) by default.

$$\text{SD} = \sqrt{\frac{\sum_{i = 1}^n (x_i - \bar{x})^2}{n - 1}}$$ $$\text{SD} = \sqrt{\frac{\sum_{i = 1}^n (x_i - \bar{x})^2}{n}}$$

Be careful!

Extra: Data cleaning and plotly

Note: We may not get to these slides in lecture; refer to them for extra examples.

Example: Universities in California 📚

Let's return to schools. Towards the end of the last section, we fixed the data type of the 'Enrollment' column to be int64, which means we can now perform calculations with it.

Enrollment vs. year founded

plotly

plotly is a plotting library that creates interactive graphs. It's not included in your dsc80 conda environment, so you'll need to pip install it.

Enrollment vs. year founded, but interactive

You can even create plotly plots by default by setting pandas' plotting backend to plotly:

Summary, next time

Summary, next time