Lecture 9 – Combining Data

DSC 80, Spring 2022

Announcements

Agenda

Good resource: pandas User Guide.

Recap: Concatenating vertically

Example: Grades

By default, pd.concat stacks DataFrames row-wise, i.e. on top of one another.

Let's use pd.concat on a list of the above two DataFrames.

⚠️ Warning: No loops!

total = pd.DataFrame()
for df in dataframes:
    total = total.concat(df)

Concatenating horizontally

Example: Grades (again)

Suppose we have two DataFrames, exams and assignments, which both contain different attributes for the same individuals.

If we try to combine these DataFrames with pd.concat, we don't quite get what we're looking for.

But that's where the axis argument becomes handy.

Remember, most pandas operations default to axis=0, but here we want to concatenate the columns of exams to the columns of assignments, so we should use axis=1.

Note that the 'Name' column appears twice!

Concatenating horizontally

Note that the call to pd.concat below works as expected, even though the orders of the names in exams_by_name and assignments_by_name are different.

Remember that pd.concat only looks at the index when combining rows, not at any other columns.

Missing rows?

If we concatenate two DataFrames that don't share row indexes, NaNs are added in the rows that aren't shared.

Summary: pd.concat

Joining and merging

Concatenating horizontally

Joins

Example

Here, the join key is 'Player Id'.

The merge method

Let's work with a small example.

We didn't specify which columns to merge on, so it defaulted to 'City'.

Join types: inner joins

Different join types handle mismatches differently

There are four types of joins.

Examples of join types

Let's try an outer join.

Note the NaNs in the rows for 'Rome' and 'Shanghai'.

Also note that an outer join is what pd.concat does by default, when there are no duplicated keys in either DataFrame.

Let's try left and right joins.

First, a left join.

How about a right join?

Note that a.merge(b, how='left') is the same as b.merge(a, how='right'). The only difference is the order of the columns in the result.

Specifying join keys

This is not what we're looking for.

Instead, we need to tell pandas to look in the 'Name' column of exams and 'Student' column of overall.

If there are shared column names in the two DataFrames you are merging that you are not using as join keys, by default '_x' and '_y' are appended to their names.

If the desired join key is in the index, assign left_index or right_index to True.

Many-to-one & many-to-many joins

One-to-one joins

Many-to-one joins

Note that when merging profs and schools, the information from schools is duplicated ('University of California, San Diego' appears twice and 'University of California, Berkeley' appears three times).

Many-to-many joins

Many-to-many joins are joins where both DataFrames have duplicate values in the join key.

Before running the following cell, try predicting the number of rows in the output.

Example: SDPD vehicle stops

Aside: accessing file names programmatically

Loading in the data

We need to concatenate these two DataFrames vertically.

Investigating races

Right now, 'subject_race' is stored as a single character. What does 'I' mean? 'H'?

Fortunately, we have access to another dataset that describes each of the race codes.

Let's join the distribution of races with the DataFrame of race codes.

Question: Is this a one-to-one join?

The level of granularity of the races in our data right now seems inconsistent. For instance, 'WHITE' and 'BLACK' are much more broad than 'FILIPINO', 'JAPANESE', and 'GUAMANIAN'.

Adjusting granularity

Let's try and adjust our race data so that we have a consistent level of granularity. Here's what we want to create:

Race Code Description Race_Category
0 A OTHER ASIAN Asian
1 B BLACK Black
2 C CHINESE Asian
3 D CAMBODIAN Asian
4 F FILIPINO Asian

We can do this by manually defining a mapping between race codes and desired categories.

There are two ways to replace all of the 'Race Code's in races with the above categories:

Joining requires sorting, where as replacing does not. Let's go with the first option.

Now, we need to join stops with races. An important question is, what type of join should we use (inner, outer, left, right)?

So that we don't discard the individuals whose races we don't have, we will use a left join.

Now we can compute a more meaningful distribution of races.

That's a bit more helpful.

Aside: Working with time series data

Time series – why now?

Datetime types

When working with time data, you will see two different kinds of "times":

The datetime module

Python has an in-built datetime module, which contains datetime and timedelta types. These are much more convenient to deal with than strings that contain times.

Recall, Unix timestamps count the number of seconds since January 1st, 1970.

Times in pandas

Timestamps have time-related attributes, e.g. dayofweek, hour, min, sec.

Subtracting timestamps yields pd.Timedelta objects.

Timestamps in DataFrames

Example: Exam speeds

Below, we have the Final Exam starting and ending times for two sections of a course.

Question: Who finished the exam the fastest amongst all students in the course?

Approach:

  1. Concatenate the two DataFrames.
  2. Convert the time columns to pd.Timestamp.
  3. Find the difference between 'finish_exam' and 'start_exam'.
  4. Sort.
  5. Pick the fastest exam taker.

Summary, next time

Summary