# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
from IPython.display import display, IFrame
def show_def():
src = "https://docs.google.com/presentation/d/e/2PACX-1vRKMMwGtrQOeLefj31fCtmbNOaJuKY32eBz1VwHi_5ui0AGYV3MoCjPUtQ_4SB1f9x4Iu6gbH0vFvmB/embed?start=false&loop=false&delayms=60000&rm=minimal"
width = 960
height = 569
display(IFrame(src, width, height))
Reminder: Use the DSC 10 Reference Sheet.
max
, np.sqrt
, len
) and methods (e.g. .groupby
, .assign
, .plot
). Suppose you drive to a restaurant 🥘 in LA, located exactly 100 miles away.
In segment 1, when you drove 50 miles at 80 miles per hour, you drove for $\frac{50}{80}$ hours:
$$\text{speed}_1 = \frac{\text{distance}_1}{\text{time}_1}$$Similarly, in segment 2, when you drove 50 miles at 60 miles per hour, you drove for $\text{time}_2 = \frac{50}{60} \text{ hours}$.
Then,
$$\text{average speed} = \frac{50 + 50}{\frac{50}{80} + \frac{50}{60}} \text{ miles per hour} $$The harmonic mean ($\text{HM}$) of two positive numbers, $a$ and $b$, is defined as
$$\text{HM} = \frac{2}{\frac{1}{a} + \frac{1}{b}}$$It is often used to find the average of multiple rates.
Finding the harmonic mean of 80 and 60 is not hard:
2 / (1 / 80 + 1 / 60)
68.57142857142857
But what if we want to find the harmonic mean of 80 and 70? 80 and 90? 20 and 40? This would require a lot of copy-pasting, which is prone to error.
It turns out that we can define our own "harmonic mean" function just once, and re-use it multiple times.
def harmonic_mean(a, b):
return 2 / (1 / a + 1 / b)
harmonic_mean(80, 60)
68.57142857142857
harmonic_mean(20, 40)
26.666666666666664
Note that we only had to specify how to calculate the harmonic mean once!
Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we define our own function in Python, we will use the following pattern.
show_def()
bpd.read_csv
without knowing how it works.harmonic_mean(20, 40)
26.666666666666664
harmonic_mean(79, 894)
145.17163412127442
harmonic_mean(-2, 4)
-8.0
triple
has one parameter, x
.
def triple(x):
return x * 3
When we call triple
with the argument 5, within the body of triple
, x
means 5.
triple(5)
15
We can change the argument we call triple
with – we can even call it with strings!
triple(7 + 8)
45
triple('triton')
'tritontritontriton'
The names you choose for a function’s parameters are only known to that function (known as local scope). The rest of your notebook is unaffected by parameter names.
def triple(x):
return x * 3
triple(7)
21
Since we haven't defined an x
outside of the body of triple
, our notebook doesn't know what x
means.
x
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_51242/32546335.py in <module> ----> 1 x NameError: name 'x' is not defined
We can define an x
outside of the body of triple
, but that doesn't change how triple
works.
x = 15
# When triple(12) is called, you can pretend
# there's an invisible line inside the body of x
# that says x = 12.
# The x = 15 above is ignored.
triple(12)
Functions can have any number of arguments. So far, we've created a function that takes two arguments, harmonic_mean
, and a function that takes one argument, triple
.
greeting
takes no arguments!
def greeting():
return 'Hi! 👋'
greeting()
'Hi! 👋'
The body of a function is not run until you use (call) the function.
Here, we can define where_is_the_error
without seeing an error message.
def where_is_the_error(something):
'''You can describe your function within triple quotes. For example, this function
illustrates that errors don't occur until functions are executed (called).'''
return (1 / 0) + something
It is only when we call where_is_the_error
that Python gives us an error message.
where_is_the_error(5)
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) /var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_51242/3423408763.py in <module> ----> 1 where_is_the_error(5) /var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_51242/1703529954.py in where_is_the_error(something) 2 '''You can describe your function within triple quotes. For example, this function 3 illustrates that errors don't occur until functions are executed (called).''' ----> 4 return (1 / 0) + something ZeroDivisionError: division by zero
first_name
¶Let's create a function called first_name
that takes in someone's full name and returns their first name. Example behavior is shown below.
>>> first_name('Pradeep Khosla')
'Pradeep'
Hint: Use the string method .split
.
General strategy for writing functions:
'Pradeep Khosla'.split(' ')[0]
'Pradeep'
def first_name(full_name):
'''Returns the first name given a full name.'''
return full_name.split(' ')[0]
first_name('Pradeep Khosla')
'Pradeep'
# What if there are three names?
first_name('Chancellor Pradeep Khosla')
'Chancellor'
return
keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to.return
, but using return
is not strictly required.return
!print
and return
work differently!def pythagorean(a, b):
'''Computes the hypotenuse length of a triangle with legs a and b.'''
c = (a ** 2 + b ** 2) ** 0.5
print(c)
x = pythagorean(3, 4)
5.0
# No output – why?
x
# Errors – why?
x + 10
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_51242/3305400239.py in <module> 1 # Errors – why? ----> 2 x + 10 TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
def better_pythagorean(a, b):
'''Computes the hypotenuse length of a triangle with legs a and b,
and actually returns the result.
'''
c = (a ** 2 + b ** 2) ** 0.5
return c
x = better_pythagorean(3, 4)
x
5.0
x + 10
15.0
Once a function executes a return
statement, it stops running.
def motivational(quote):
return 0
print("Here's a motivational quote:", quote)
motivational('Fall seven times and stand up eight.')
0
The DataFrame roster
contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.
roster = bpd.read_csv('data/roster-anon.csv')
roster
name | section | |
---|---|---|
0 | Kavya Fquroe | 10AM |
1 | Victoria Yppmzx | 10AM |
2 | An-Chi Tmbqlr | 8AM |
... | ... | ... |
522 | Mehri Osrvjq | 9AM |
523 | Noah Byphhr | 9AM |
524 | Emily Hchqii | 9AM |
525 rows × 2 columns
What is the most common first name among DSC 10 students? (Any guesses?)
roster
name | section | |
---|---|---|
0 | Kavya Fquroe | 10AM |
1 | Victoria Yppmzx | 10AM |
2 | An-Chi Tmbqlr | 8AM |
... | ... | ... |
522 | Mehri Osrvjq | 9AM |
523 | Noah Byphhr | 9AM |
524 | Emily Hchqii | 9AM |
525 rows × 2 columns
'name'
column.first_name
function¶Somehow, we need to call first_name
on every student's 'name'
.
roster
name | section | |
---|---|---|
0 | Kavya Fquroe | 10AM |
1 | Victoria Yppmzx | 10AM |
2 | An-Chi Tmbqlr | 8AM |
... | ... | ... |
522 | Mehri Osrvjq | 9AM |
523 | Noah Byphhr | 9AM |
524 | Emily Hchqii | 9AM |
525 rows × 2 columns
roster.get('name').iloc[0]
'Kavya Fquroe'
first_name(roster.get('name').iloc[0])
'Kavya'
first_name(roster.get('name').iloc[1])
'Victoria'
Ideally, there's a better solution than doing this hundreds of times...
.apply
¶func_name
to every element of column 'col'
in DataFrame df
, usedf.get('col').apply(func_name)
.apply
method is a Series method..apply
on Series, not DataFrames..apply
is also a Series..apply(first_name)
..apply(first_name())
.roster.get('name')
0 Kavya Fquroe 1 Victoria Yppmzx 2 An-Chi Tmbqlr ... 522 Mehri Osrvjq 523 Noah Byphhr 524 Emily Hchqii Name: name, Length: 525, dtype: object
roster.get('name').apply(first_name)
0 Kavya 1 Victoria 2 An-Chi ... 522 Mehri 523 Noah 524 Emily Name: name, Length: 525, dtype: object
roster = roster.assign(
first=roster.get('name').apply(first_name)
)
roster
name | section | first | |
---|---|---|---|
0 | Kavya Fquroe | 10AM | Kavya |
1 | Victoria Yppmzx | 10AM | Victoria |
2 | An-Chi Tmbqlr | 8AM | An-Chi |
... | ... | ... | ... |
522 | Mehri Osrvjq | 9AM | Mehri |
523 | Noah Byphhr | 9AM | Noah |
524 | Emily Hchqii | 9AM | Emily |
525 rows × 3 columns
Now that we have a column containing first names, we can find the distribution of first names.
name_counts = (
roster
.groupby('first')
.count()
.sort_values('name', ascending=False)
.get(['name'])
)
name_counts
name | |
---|---|
first | |
Kevin | 7 |
Daniel | 6 |
Ryan | 6 |
... | ... |
Hengyu | 1 |
Heeju | 1 |
Zubin | 1 |
438 rows × 1 columns
Below:
Hint: Start by defining a DataFrame with only the names in name_counts
that appeared at least twice. You can use this DataFrame to answer both questions.
shared_names = name_counts[name_counts.get('name') >= 2] # Bar chart. shared_names.sort_values('name').plot(kind='barh', y='name'); # Proportion = # students with a shared name / total # of students. shared_names.get('name').sum() / roster.shape[0]
...
Ellipsis
...
Ellipsis
.apply
works with built-in functions, too!¶name_counts.get('name')
first Kevin 7 Daniel 6 Ryan 6 .. Hengyu 1 Heeju 1 Zubin 1 Name: name, Length: 438, dtype: int64
# Not necessarily meaningful, but doable.
name_counts.get('name').apply(np.log)
first Kevin 1.95 Daniel 1.79 Ryan 1.79 ... Hengyu 0.00 Heeju 0.00 Zubin 0.00 Name: name, Length: 438, dtype: float64
In name_counts
, first names are stored in the index, which is not a Series. This means we can't use .apply
on it.
name_counts.index
Index(['Kevin', 'Daniel', 'Ryan', 'Matthew', 'Eric', 'Brian', 'Nathan', 'Jennifer', 'Karina', 'Andrew', ... 'Isaac', 'Ifra', 'Hyunbin', 'Huiting', 'Hongyu', 'Hien', 'Henry', 'Hengyu', 'Heeju', 'Zubin'], dtype='object', name='first', length=438)
name_counts.index.apply(max)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_51242/1905262767.py in <module> ----> 1 name_counts.index.apply(max) AttributeError: 'Index' object has no attribute 'apply'
To help, we can use .reset_index()
to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.
# What is the max of an individual string?
name_counts.reset_index().get('first').apply(max)
0 v 1 n 2 y .. 435 y 436 u 437 u Name: first, Length: 438, dtype: object
'Kavya Fquroe'
wants to see if there's another 'Kavya'
in their section. Strategy:
'Kavya Fquroe'
in?'Kavya'
?roster
name | section | first | |
---|---|---|---|
0 | Kavya Fquroe | 10AM | Kavya |
1 | Victoria Yppmzx | 10AM | Victoria |
2 | An-Chi Tmbqlr | 8AM | An-Chi |
... | ... | ... | ... |
522 | Mehri Osrvjq | 9AM | Mehri |
523 | Noah Byphhr | 9AM | Noah |
524 | Emily Hchqii | 9AM | Emily |
525 rows × 3 columns
which_section = roster[roster.get('name') == 'Kavya Fquroe'].get('section').iloc[0]
which_section
'10AM'
first_cond = roster.get('first') == 'Kavya' # A Boolean Series!
section_cond = roster.get('section') == which_section # A Boolean Series!
how_many = roster[first_cond & section_cond].shape[0]
how_many
1
shared_first_and_section
¶Let's create a function named shared_first_and_section
. It will take in the full name of a student and return the number of students in their section with the same first name and section (including them).
Note: This is the first function we're writing that involves using a DataFrame within the function – this is fine!
def shared_first_and_section(name):
# First, find the row corresponding to that full name in roster.
# We're assuming that full names are unique.
row = roster[roster.get('name') == name]
# Then, get that student's first name and section.
first = row.get('first').iloc[0]
section = row.get('section').iloc[0]
# Now, find all the students with the same first name and section.
shared_info = roster[(roster.get('first') == first) & (roster.get('section') == section)]
# Return the number of such students.
return shared_info.shape[0]
shared_first_and_section('Kavya Fquroe')
1
# This means that there are 4 Kevins, including Kevin Pxssmm, in whichever section Kevin Pxssmm is in!
shared_first_and_section('Kevin Pxssmm')
4
Now, let's add a column to roster
that contains the values returned by shared_first_and_section
.
roster = roster.assign(shared=roster.get('name').apply(shared_first_and_section))
roster
name | section | first | shared | |
---|---|---|---|---|
0 | Kavya Fquroe | 10AM | Kavya | 1 |
1 | Victoria Yppmzx | 10AM | Victoria | 1 |
2 | An-Chi Tmbqlr | 8AM | An-Chi | 1 |
... | ... | ... | ... | ... |
522 | Mehri Osrvjq | 9AM | Mehri | 1 |
523 | Noah Byphhr | 9AM | Noah | 1 |
524 | Emily Hchqii | 9AM | Emily | 1 |
525 rows × 4 columns
Let's find all of the students who are in a section with someone that has the same first name as them.
roster[(roster.get('shared') >= 2)].sort_values('shared', ascending=False)
name | section | first | shared | |
---|---|---|---|---|
57 | Kevin Rqqbja | 9AM | Kevin | 4 |
332 | Kevin Wvghxc | 9AM | Kevin | 4 |
106 | Kevin Pxssmm | 9AM | Kevin | 4 |
... | ... | ... | ... | ... |
301 | Andrew Fukocl | 1PM | Andrew | 2 |
303 | Brian Riqekv | 9AM | Brian | 2 |
514 | Nathan Arlyoy | 1PM | Nathan | 2 |
50 rows × 4 columns
We can narrow this down to a particular lecture section if we'd like.
one_section_only = (
roster[(roster.get('shared') >= 2) &
(roster.get('section') == '1PM')]
.sort_values('shared', ascending=False)
)
one_section_only
name | section | first | shared | |
---|---|---|---|---|
283 | Ryan Ozjjjx | 1PM | Ryan | 3 |
286 | Ryan Ohuhmg | 1PM | Ryan | 3 |
349 | Ryan Mhwpch | 1PM | Ryan | 3 |
... | ... | ... | ... | ... |
404 | Michael Qxcbll | 1PM | Michael | 2 |
472 | Jeremy Yktcwc | 1PM | Jeremy | 2 |
514 | Nathan Arlyoy | 1PM | Nathan | 2 |
15 rows × 4 columns
For instance, the above DataFrame preview is telling us that there are 3 Ryans, 2 Michaels, 2 Jeremys, and 2 Nathans in the 1PM section of DSC 10.
# All of the names shared by multiple students in the 1PM section.
one_section_only.get('first').unique()
array(['Ryan', 'Pranav', 'Andrew', 'Peter', 'Nathan', 'Jeremy', 'Michael'], dtype=object)
While the DataFrames on the previous slide contain the info we were looking for, they're not organized very conveniently. For instance, there are three rows containing the fact that there are 3 Ryans in the 1PM lecture section.
Wouldn't it be great if we could create a DataFrame like the one below? We'll see how on Wednesday!
section | first | name | shared | |
---|---|---|---|---|
0 | 8AM | Daniel | 3 | 3 |
1 | 1PM | Danielle | 1 | 1 |
2 | 10AM | Eric | 2 | 2 |
3 | 10AM | Ethan | 3 | 3 |
4 | 10AM | Justin | 3 | 3 |
5 | 9AM | Kevin | 4 | 4 |
Find the longest first name in the class that is shared by at least two students in the same section.
Hint: You'll have to use both .assign
and .apply
.
with_len = roster.assign(name_len=roster.get('first').apply(len)) with_len[with_len.get('shared') >= 2].sort_values('name_len', ascending=False).get('first').iloc[0]
...
Ellipsis
.apply
method allows us to call a function on every single element of a Series, which usually comes from .get
ting a column of a DataFrame.More advanced DataFrame manipulations!