Lecture 6 – Concatenating and Merging

DSC 80, Spring 2023

Agenda

Aside: Working with time series data

Time series – why now?

Datetime types

When working with time data, you will see two different kinds of "times":

The datetime module

Python has an in-built datetime module, which contains datetime and timedelta types. These are much more convenient to deal with than strings that contain times.

Unix timestamps count the number of seconds since January 1st, 1970.

Times in pandas

Timestamps have time-related attributes, e.g. dayofweek, hour, min, sec.

Subtracting timestamps yields pd.Timedelta objects.

Example: Exam speeds

Below, we have the Final Exam starting and ending times for two sections of a course.

Question: Who took the longest time to finish the exam?

Concatenating vertically

Example: Grades

Consider the students from our previous example. Suppose their grades are given to us in separate DataFrames. Note that these DataFrames contain the same attributes, but for different individuals.

Question: How do we combine both DataFrames into a single, larger DataFrame?

Concatenating vertically

Example: Grades

By default, pd.concat takes a list of DataFrames and stacks them row-wise, i.e. on top of one another.

Setting the optional argument ignore_index to True fixes the index (which .reset_index() also could do).

To keep track of which original DataFrame each row came from, we can use the keys optional argument, though if we do this, the resulting DataFrame has a MultiIndex.

Adding a single row

To add a single row to a DataFrame, create a new DataFrame that contains the single row, and use pd.concat.

The DataFrame append method does exist, though it's deprecated.

Missing columns?

If we concatenate two DataFrames that don't share the same column names, NaNs are added in the columns that aren't shared.

⚠️ Warning: No loops!

total = pd.DataFrame()
for df in dataframes:
    total = total.concat(df)

Aside: Accessing file names programmatically

The following does something similar, but in the shell.

Concatenating horizontally

Example: Grades (again)

Suppose we have two DataFrames, exams and assignments, which both contain different attributes for the same individuals.

If we try to combine these DataFrames with pd.concat, we don't quite get what we're looking for.

But that's where the axis argument becomes handy.

Remember, most pandas operations default to axis=0, but here we want to concatenate the columns of exams to the columns of assignments, so we should use axis=1.

Note that the 'Name' column appears twice!

Concatenating horizontally

Note that the call to pd.concat below combines information about each individual correctly, even though the orders of the names in exams_by_name and assignments_by_name are different.

Remember that pd.concat only looks at the index when combining rows, not at any other columns.

Summary: pd.concat

Merging

Joining

Let's work with a small example.

We'd like to combine both DataFrames, but it's not immediately clear if pd.concat would be useful.

It turns out that the right tool to use is the merge method.

The merge method

Join types: inner joins

Different join types

We can change the type of join performed by changing the how argument in merge. Let's experiment!

Note that an outer join is what pd.concat performs by default, when there are no duplicated keys in either DataFrame.

Different join types handle mismatches differently

There are four types of joins.

Symmetry

Note that a.merge(b, how='left') contains the same information as b.merge(a, how='right'), just in a different order.

Specifying join keys

This is not what we're looking for:

Instead, we need to tell pandas to look in the 'Name' column of exams and 'Student' column of overall.

If there are shared column names in the two DataFrames you are merging that you are not using as join keys, '_x' and '_y' are appended to their names by default.

Many-to-one & many-to-many joins

One-to-one joins

Many-to-one joins

Note that when merging profs and schools, the information from schools is duplicated.

Many-to-many joins

Many-to-many joins are joins where both DataFrames have duplicate values in the join key.

Before running the following cell, try predicting the number of rows in the output.

Summary, next time

Summary

Next time

Cleaning messy, real-world data.