In [ ]:
from dsc80_utils import *
In [ ]:
# Pandas Tutor setup
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Lecture 4 – Merging, Transforming¶

Merging¶

Example: Name categories¶

This New York Times article claims that certain categories of names are becoming more popular. For example:

  • Forbidden names like Lucifer, Lilith, Kali, and Danger.

  • Evangelical names like Amen, Savior, Canaan, and Creed.

  • Mythological names.

  • It also claims that baby boomer names are becoming less popular.

Let's see if we can verify these claims using data!

Loading in the data¶

Our first DataFrame, baby, has one row for every combination of 'Name', 'Sex', and 'Year'.

In [ ]:
baby_path = Path('data') / 'baby.csv'
baby = pd.read_csv(baby_path)
baby

Our second DataFrame, nyt, contains the New York Times' categorization of each of several names, based on the aforementioned article.

In [ ]:
nyt_path = Path('data') / 'nyt_names.csv'
nyt = pd.read_csv(nyt_path)
nyt

Issue: To find the number of babies born with (for example) forbidden names each year, we need to combine information from both baby and nyt.

Merging¶

  • We want to link rows from baby and nyt together whenever the names match up.
  • This is a merge (pandas term), i.e. a join (SQL term).
  • A merge is appropriate when we have two sources of information about the same individuals that is linked by a common column(s).
  • The common column(s) are called the join key.

Example merge¶

Let's demonstrate on a small subset of baby and nyt.

In [ ]:
nyt_small = nyt.iloc[[11, 12, 14]].reset_index(drop=True)

names_to_keep = ['Julius', 'Karen', 'Noah']
baby_small = (baby
 .query("Year == 2020 and Name in @names_to_keep")
 .reset_index(drop=True)
)

dfs_side_by_side(baby_small, nyt_small)
In [ ]:
%%pt
baby_small.merge(nyt_small, left_on='Name', right_on='nyt_name')

The merge method¶

  • The merge DataFrame method joins two DataFrames by columns or indexes.

    • As mentioned before, "merge" is just the pandas word for "join."
  • When using the merge method, the DataFrame before merge is the "left" DataFrame, and the DataFrame passed into merge is the "right" DataFrame.

    • In baby_small.merge(nyt_small), baby_small is considered the "left" DataFrame and nyt_small is the "right" DataFrame; the columns from the left DataFrame appear to the left of the columns from right DataFrame.
  • By default:

    • If join keys are not specified, all shared columns between the two DataFrames are used.
    • The "type" of join performed is an inner join.

Join types: inner joins¶

In [ ]:
%%pt
baby_small.merge(nyt_small, left_on='Name', right_on='nyt_name')
  • Note that 'Noah' and 'Freya' do not appear in the merged DataFrame.
  • This is because there is:
    • no 'Noah' in the right DataFrame (nyt_small), and
    • no 'Freya' in the left DataFrame (baby_small).
  • The default type of join that merge performs is an inner join, which keeps the intersection of the join keys.
No description has been provided for this image

Different join types¶

We can change the type of join performed by changing the how argument in merge. Let's experiment!

In [ ]:
%%pt
# Note the NaNs!
baby_small.merge(nyt_small, left_on='Name', right_on='nyt_name', how='left')
In [ ]:
%%pt
baby_small.merge(nyt_small, left_on='Name', right_on='nyt_name', how='right')
In [ ]:
%%pt
baby_small.merge(nyt_small, left_on='Name', right_on='nyt_name', how='outer')

Different join types handle mismatches differently¶

There are four types of joins.

  • Inner: keep only matching keys (intersection).
  • Outer: keep all keys in both DataFrames (union).
  • Left: keep all keys in the left DataFrame, whether or not they are in the right DataFrame.
  • Right: keep all keys in the right DataFrame, whether or not they are in the left DataFrame.
    • Note that a.merge(b, how='left') contains the same information as b.merge(a, how='right'), just in a different order.
No description has been provided for this image

Notes on the merge method¶

  • merge is flexible – you can merge using a combination of columns, or the index of the DataFrame.
  • If the two DataFrames have the same column names, pandas will add _x and _y to the duplicated column names to avoid having columns with the same name (change these the suffixes argument).
  • There is, in fact, a join method, but it's actually a wrapper around merge with fewer options.
  • As always, the documentation is your friend!

Lots of pandas operations do an implicit outer join!¶

  • pandas will almost always try to match up index values using an outer join.
  • It won't tell you that it's doing an outer join, it'll just throw NaNs in your result!
In [ ]:
df1 = pd.DataFrame({'a': [1, 2, 3]}, index=['hello', 'dsc80', 'students'])
df2 = pd.DataFrame({'b': [10, 20, 30]}, index=['dsc80', 'is', 'awesome'])
dfs_side_by_side(df1, df2)
In [ ]:
df1['a'] + df2['b']

Many-to-one & many-to-many joins¶

One-to-one joins¶

  • So far in this lecture, the joins we have worked with are called one-to-one joins.
  • Neither the left DataFrame (baby_small) nor the right DataFrame (nyt_small) contained any duplicates in the join key.
  • What if there are duplicated join keys, in one or both of the DataFrames we are merging?
In [ ]:
# Run this cell to set up the next example.
profs = pd.DataFrame(
[['Sam', 'UCB', 5],
 ['Sam', 'UCSD', 5],
 ['Janine', 'UCSD', 8],
 ['Marina', 'UIC', 7],
 ['Justin', 'OSU', 5],
 ['Soohyun', 'UCSD', 2],
 ['Suraj', 'UCB', 2]],
    columns=['Name', 'School', 'Years']
)

schools = pd.DataFrame({
    'Abr': ['UCSD', 'UCLA', 'UCB', 'UIC'],
    'Full': ['University of California San Diego', 'University of California, Los Angeles', 'University of California, Berkeley', 'University of Illinois Chicago']
})

programs = pd.DataFrame({
    'uni': ['UCSD', 'UCSD', 'UCSD', 'UCB', 'OSU', 'OSU'],
    'dept': ['Math', 'HDSI', 'COGS', 'CS', 'Math', 'CS'],
    'grad_students': [205, 54, 281, 439, 304, 193]
})

Many-to-one joins¶

  • Many-to-one joins are joins where one of the DataFrames contains duplicate values in the join key.
  • The resulting DataFrame will preserve those duplicate entries as appropriate.
In [ ]:
dfs_side_by_side(profs, schools)

Note that when merging profs and schools, the information from schools is duplicated.

  • 'University of California, San Diego' appears three times.
  • 'University of California, Berkeley' appears twice.
In [ ]:
%%pt
profs.merge(schools, left_on='School', right_on='Abr', how='left')

Many-to-many joins¶

Many-to-many joins are joins where both DataFrames have duplicate values in the join key.

In [ ]:
dfs_side_by_side(profs, programs)

Before running the following cell, try predicting the number of rows in the output.

In [ ]:
%%pt
profs.merge(programs, left_on='School', right_on='uni')
  • merge stitched together every UCSD row in profs with every UCSD row in programs.
  • Since there were 3 UCSD rows in profs and 3 in programs, there are $3 \cdot 3 = 9$ UCSD rows in the output. The same applies for all other schools.

Question 🤔

Fill in the blank so that the last statement evaluates to True.

df = profs.merge(programs, left_on='School', right_on='uni')
df.shape[0] == (____).sum()

Don't use merge (or join) in your solution!

In [ ]:
dfs_side_by_side(profs, programs)
In [ ]:
# Your code goes here.

Returning back to our original question¶

Let's find the popularity of baby name categories over time. To start, we'll define a DataFrame that has one row for every combination of 'category' and 'Year'.

In [ ]:
cate_counts = (
    baby
    .merge(nyt, left_on='Name', right_on='nyt_name')
    .groupby(['category', 'Year'])
    ['Count']
    .sum()
    .reset_index()
)
cate_counts
In [ ]:
# We'll talk about plotting code soon!
import plotly.express as px
fig = px.line(cate_counts, x='Year', y='Count',
              facet_col='category', facet_col_wrap=3,
              facet_row_spacing=0.15,
              width=600, height=400)
fig.update_yaxes(matches=None, showticklabels=False)

Transforming¶

Transforming values¶

  • A transformation results from performing some operation on every element in a sequence, e.g. a Series.

  • To do this, we'll typically use the apply method. apply is very flexible – it takes in a function, which itself takes in a single value as input and returns a single value.

In [ ]:
baby
In [ ]:
def number_of_vowels(string):
    return sum(c in 'aeiou' for c in string.lower())

baby['Name'].apply(number_of_vowels)
In [ ]:
# Built-in functions work with apply, too.
baby['Name'].apply(len)

The price of apply¶

Unfortunately, apply runs really slowly!

In [ ]:
%%timeit
baby['Name'].apply(number_of_vowels)
In [ ]:
%%timeit
res = []
for name in baby['Name']:
    res.append(number_of_vowels(name))

Internally, apply actually just runs a for-loop!

So, when possible – say, when applying arithmetic operations – we should work on Series objects directly and avoid apply!

The price of apply¶

In [ ]:
%%timeit
baby['Year'] // 10 * 10 # Rounds down to the nearest multiple of 10.
In [ ]:
%%timeit
baby['Year'].apply(lambda y: y // 10 * 10)

100x slower!

The .str accessor¶

For string operations, pandas provides a convenient .str accessor.

In [ ]:
%%timeit
baby['Name'].str.len()
In [ ]:
%%timeit
baby['Name'].apply(len)

It's very convenient and runs about the same speed as apply!

Other data representations¶

Representations of tabular data¶

  • In Python, we'll work with DataFrames in pandas.

    • When we say pandas DataFrame, we're talking about the pandas API for its DataFrame objects.
      • API stands for "application programming interface." We'll learn about these more soon.
    • When we say "DataFrame", we're referring to a general way to represent data (rows and columns, with labels for both rows and columns).
  • There many other ways to work with data tables!

    • Examples: R data frames, SQL databases, spreadsheets, or even matrices from linear algebra.
    • If you know/learn SQL, you'll find many similaries (e.g. slicing columns, filtering rows, grouping, joining, etc.).
    • Relational algebra captures common data operations between many data table systems.
  • Why use DataFrames over something else?

DataFrames vs. spreadsheets¶

  • DataFrames give us a data lineage: the code records down data changes. Not so in spreadsheets!
  • Using a general-purpose programming language gives us the ability to handle much larger datasets, and we can use distributed computing systems to handle massive datasets.

DataFrames vs. matrices¶

\begin{split} \begin{aligned} \mathbf{X} = \begin{bmatrix} 1 & 0 \\ 0 & 4 \\ 0 & 0 \\ \end{bmatrix} \end{aligned} \end{split}
  • Matrices are mathematical objects. They only hold numbers, but have many useful properties (which you've learned about in your linear algebra class, Math 18).
  • Often, we process data from a DataFrame into matrix format for machine learning models. We'll see this more in a few weeks.

DataFrames vs. relations¶

  • Relations are the data representation for relational database systems (e.g. MySQL, PostgreSQL, etc.).
  • Database systems are much better than DataFrames at storing many large data tables and handling concurrency (many people reading and writing data at the same time).
  • Common workflow: load a subset of data in from a database system into pandas, then make a plot.
  • Or: load and clean data in pandas, then store it in a database system for others to use.

Summary¶

  • There is no "formula" to automatically resolve Simpson's paradox! Domain knowledge is important.
  • We've covered most of the primary DataFrame operations: subsetting, aggregating, joining, and transforming.

Next time¶

Data cleaning: applying what we've already learned to real-world, messy data!