# Run this cell to set up packages for lecture.
from lec08_imports import *
- Discussion is today.
- The Midterm Project is coming out soon. Try to find a partner!👯♀️
- Working with a partner is optional but highly recommended. You can work with a partner from any lecture section. You must follow these project partner guidelines.
- Need a partner? Come to the mixer event tonight at 6PM right outside the discussion section.
- Lab 2 is due Thursday at 11:59PM.
- Homework 2 is due Sunday at 11:59PM.
- Sign up for a quiz review appointment one-on-one with a tutor!
- Functions.
- Applying functions to DataFrames.
- Example: Student names.
*Reminder:* Use the DSC 10 Reference Sheet.
Defining functions¶
- We've learned how to do quite a bit in Python:
- Manipulate arrays, Series, and DataFrames.
- Perform operations on strings.
- Create visualizations.
- But so far, we've been restricted to using existing functions (e.g.
) and methods (e.g..groupby
Suppose you drive to a restaurant 🥘 in LA, located exactly 100 miles away.
- For the first 50 miles, you drive at 80 miles per hour.
- For the last 50 miles, you drive at 60 miles per hour.
- Question: What is your average speed throughout the journey?
- 🚨 The answer is not 70 miles per hour! Remember, from Homework 1, you need to use the fact that $\text{speed} = \frac{\text{distance}}{\text{time}}$.
In segment 1, when you drove 50 miles at 80 miles per hour, you drove for $\frac{50}{80}$ hours:
$$\text{speed}_1 = \frac{\text{distance}_1}{\text{time}_1}$$Similarly, in segment 2, when you drove 50 miles at 60 miles per hour, you drove for $\text{time}_2 = \frac{50}{60} \text{ hours}$.
$$\text{average speed} = \frac{50 + 50}{\frac{50}{80} + \frac{50}{60}} \text{ miles per hour} $$Example: Harmonic mean¶
The harmonic mean ($\text{HM}$) of two positive numbers, $a$ and $b$, is defined as
$$\text{HM} = \frac{2}{\frac{1}{a} + \frac{1}{b}}$$It is often used to find the average of multiple rates.
Finding the harmonic mean of 80 and 60 is not hard:
2 / (1 / 80 + 1 / 60)
But what if we want to find the harmonic mean of 80 and 70? 80 and 90? 20 and 40? This would require a lot of copy-pasting, which is prone to error.
It turns out that we can define our own "harmonic mean" function just once, and re-use it multiple times.
def harmonic_mean(a, b):
return 2 / (1 / a + 1 / b)
harmonic_mean(80, 60)
harmonic_mean(20, 40)
Note that we only had to specify how to calculate the harmonic mean once!
Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we define our own function in Python, we will use the following pattern.
Functions are "recipes"¶
- Functions take in inputs, known as arguments, do something, and produce some outputs.
- The beauty of functions is that you don't need to know how they are implemented in order to use them!
- For instance, you've been using the function
without knowing how it works. - This is the premise of the idea of abstraction in computer science – you'll hear a lot about this if you take DSC 20.
- For instance, you've been using the function
harmonic_mean(20, 40)
harmonic_mean(79, 894)
harmonic_mean(-2, 4)
Parameters and arguments¶
has one parameter, x
def triple(x):
return x * 3
When we call triple
with the argument 5, within the body of triple
, x
means 5.
We can change the argument we call triple
with – we can even call it with strings!
triple(7 + 8)
Scope 🩺¶
The names you choose for a function’s parameters are only known to that function (known as local scope). The rest of your notebook is unaffected by parameter names.
def triple(x):
return x * 3
Since we haven't defined an x
outside of the body of triple
, our notebook doesn't know what x
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[57], line 1 ----> 1 x NameError: name 'x' is not defined
We can define an x
outside of the body of triple
, but that doesn't change how triple
x = 15
# When triple(12) is called, you can pretend
# there's an invisible line inside the body of x
# that says x = 12.
# The x = 15 above is ignored.
Functions can take 0 or more arguments¶
Functions can take any number of arguments. So far, we've created a function with two arguments, harmonic_mean
, and a function with one argument, triple
takes no arguments!
def greeting():
return 'Hi! 👋'
'Hi! 👋'
Functions don't run until you call them!¶
The body of a function is not run until you use (call) the function.
Here, we can define where_is_the_error
without seeing an error message.
def where_is_the_error(something):
'''You can describe your function within triple quotes. For example, this function
illustrates that errors don't occur until functions are executed (called).'''
return (1 / 0) + something
It is only when we call where_is_the_error
that Python gives us an error message.
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[71], line 1 ----> 1 where_is_the_error(5) Cell In[68], line 4, in where_is_the_error(something) 1 def where_is_the_error(something): 2 '''You can describe your function within triple quotes. For example, this function 3 illustrates that errors don't occur until functions are executed (called).''' ----> 4 return (1 / 0) + something ZeroDivisionError: division by zero
Example: first_name
Let's create a function called first_name
that takes in someone's full name and returns their first name. Example behavior is shown below.
>>> first_name('Pradeep Khosla')
Hint: Use the string method .split
General strategy for writing functions:
- First, try and get the behavior to work on a single example.
- Then, encapsulate that behavior inside a function.
'Pradeep Khosla'.split(' ')[0]
def first_name(full_name):
'''Returns the first name given a full name.'''
return full_name.split(' ')[0]
first_name('Pradeep Khosla')
# What if there are three names?
first_name('Chancellor Pradeep Khosla')
- The
keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to. - Most functions we write will use
, but usingreturn
is not strictly required.- If you want to be able to save the output of your function to a variable, you must use
- If you want to be able to save the output of your function to a variable, you must use
- Be careful:
work differently!
def pythagorean(a, b):
'''Computes the hypotenuse length of a right triangle with legs a and b.'''
c = (a ** 2 + b ** 2) ** 0.5
x = pythagorean(3, 4)
# No output – why?
# Errors – why?
x + 10
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[90], line 2 1 # Errors – why? ----> 2 x + 10 TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
def better_pythagorean(a, b):
'''Computes the hypotenuse length of a right triangle with legs a and b,
and actually returns the result.
c = (a ** 2 + b ** 2) ** 0.5
return c
x = better_pythagorean(3, 4)
x + 10
Once a function executes a return
statement, it stops running.
def motivational(quote):
return 0
print("Here's a motivational quote:", quote)
motivational('Fall seven times and stand up eight.')
Applying functions to DataFrames¶
DSC 10 student data¶
The DataFrame roster
contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.
roster = bpd.read_csv('data/roster-anon.csv')
name | section | |
0 | Jacob Rrrkga | 10AM |
1 | Steven Dhixba | 10AM |
2 | Rundong Qphuob | 9AM |
... | ... | ... |
439 | Shreyesh Tlbgcn | 1PM |
440 | Sally Tthnjf | 10AM |
441 | Lawrence Zdeaft | 10AM |
442 rows × 2 columns
Example: Common first names¶
What is the most common first name among DSC 10 students? (Any guesses?)
name | section | |
0 | Jacob Rrrkga | 10AM |
1 | Steven Dhixba | 10AM |
2 | Rundong Qphuob | 9AM |
... | ... | ... |
439 | Shreyesh Tlbgcn | 1PM |
440 | Sally Tthnjf | 10AM |
441 | Lawrence Zdeaft | 10AM |
442 rows × 2 columns
- Problem: We can't answer that right now, since we don't have a column with first names. If we did, we could group by it.
- Solution: Use our function that extracts first names on every element of the
Using our first_name
Somehow, we need to call first_name
on every student's 'name'
name | section | |
0 | Jacob Rrrkga | 10AM |
1 | Steven Dhixba | 10AM |
2 | Rundong Qphuob | 9AM |
... | ... | ... |
439 | Shreyesh Tlbgcn | 1PM |
440 | Sally Tthnjf | 10AM |
441 | Lawrence Zdeaft | 10AM |
442 rows × 2 columns
'Jacob Rrrkga'
Ideally, there's a better solution than doing this hundreds of times...
- To apply the function
to every element of column'col'
in DataFramedf
, use
- The
method is a Series method.- Important: We use
on Series, not DataFrames. - The output of
is also a Series.
- Important: We use
- Pass just the name of the function – don't call it!
- Good ✅:
. - Bad ❌:
- Good ✅:
0 Jacob Rrrkga 1 Steven Dhixba 2 Rundong Qphuob ... 439 Shreyesh Tlbgcn 440 Sally Tthnjf 441 Lawrence Zdeaft Name: name, Length: 442, dtype: object
0 Jacob 1 Steven 2 Rundong ... 439 Shreyesh 440 Sally 441 Lawrence Name: name, Length: 442, dtype: object
Example: Common first names¶
roster = roster.assign(
name | section | first | |
0 | Jacob Rrrkga | 10AM | Jacob |
1 | Steven Dhixba | 10AM | Steven |
2 | Rundong Qphuob | 9AM | Rundong |
... | ... | ... | ... |
439 | Shreyesh Tlbgcn | 1PM | Shreyesh |
440 | Sally Tthnjf | 10AM | Sally |
441 | Lawrence Zdeaft | 10AM | Lawrence |
442 rows × 3 columns
Now that we have a column containing first names, we can find the distribution of first names.
name_counts = (
.sort_values('name', ascending=False)
name | |
first | |
Ryan | 9 |
Kyle | 5 |
Matthew | 4 |
... | ... |
Isabela | 1 |
Isabel | 1 |
Zonglin | 1 |
382 rows × 1 columns
- Create a bar chart showing the number of students with each first name, but only include first names shared by at least two students.
- Determine the proportion of students in DSC 10 who have a first name that is shared by at least two students.
Hint: Start by defining a DataFrame with only the names in name_counts
that appeared at least twice. You can use this DataFrame to answer both questions.
✅ Click here to see the solutions after you've tried it yourself.
shared_names = name_counts[name_counts.get('name') >= 2] # Bar chart. shared_names.sort_values('name').plot(kind='barh', y='name'); # Proportion = # students with a shared name / total # of students. shared_names.get('name').sum() / roster.shape[0]
works with built-in functions, too!¶
first Ryan 9 Kyle 5 Matthew 4 .. Isabela 1 Isabel 1 Zonglin 1 Name: name, Length: 382, dtype: int64
# Not necessarily meaningful, but doable.
first Ryan 2.20 Kyle 1.61 Matthew 1.39 ... Isabela 0.00 Isabel 0.00 Zonglin 0.00 Name: name, Length: 382, dtype: float64
Aside: Resetting the index¶
In name_counts
, first names are stored in the index, which is not a Series. This means we can't use .apply
on it.
Index(['Ryan', 'Kyle', 'Matthew', 'Dylan', 'Brian', 'Gabriel', 'Calvin', 'Kevin', 'Michael', 'Jacob', ... 'Iris', 'Ivan', 'Ishita', 'Ishayu', 'Ishaan', 'Isaiah', 'Isabella', 'Isabela', 'Isabel', 'Zonglin'], dtype='object', name='first', length=382)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[149], line 1 ----> 1 name_counts.index.apply(max) AttributeError: 'Index' object has no attribute 'apply'
To help, we can use .reset_index()
to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.
# What is the max of an individual string?
0 y 1 y 2 w .. 379 s 380 s 381 o Name: first, Length: 382, dtype: object
Example: Shared first names and sections¶
- Suppose you're one of the $\approx$22% of students in DSC 10 who has a first name that is shared with at least one other student.
- Let's try and determine whether someone in your lecture section shares the same first name as you.
- For example, maybe
'Jacob Rrrkga'
wants to see if there's another'Jacob'
in their section.
- For example, maybe
- Which section is
'Jacob Rrrkga'
in? - How many people in that section have a first name of
name | section | first | |
0 | Jacob Rrrkga | 10AM | Jacob |
1 | Steven Dhixba | 10AM | Steven |
2 | Rundong Qphuob | 9AM | Rundong |
... | ... | ... | ... |
439 | Shreyesh Tlbgcn | 1PM | Shreyesh |
440 | Sally Tthnjf | 10AM | Sally |
441 | Lawrence Zdeaft | 10AM | Lawrence |
442 rows × 3 columns
which_section = roster[roster.get('name') == 'Jacob Rrrkga'].get('section').iloc[0]
first_cond = roster.get('first') == 'Jacob' # A Boolean Series!
section_cond = roster.get('section') == which_section # A Boolean Series!
how_many = roster[first_cond & section_cond].shape[0]
Another function: shared_first_and_section
Let's create a function named shared_first_and_section
. It will take in the full name of a student and return the number of students in their section with the same first name and section (including them).
Note: This is the first function we're writing that involves using a DataFrame within the function – this is fine!
def shared_first_and_section(name):
# First, find the row corresponding to that full name in roster.
# We're assuming that full names are unique.
row = roster[roster.get('name') == name]
# Then, get that student's first name and section.
first = row.get('first').iloc[0]
section = row.get('section').iloc[0]
# Now, find all the students with the same first name and section.
shared_info = roster[(roster.get('first') == first) & (roster.get('section') == section)]
# Return the number of such students.
return shared_info.shape[0]
shared_first_and_section('Jacob Rrrkga')
# This means there is no other Rundong in the same section as Rundong Qphuob.
shared_first_and_section('Rundong Qphuob')
Now, let's add a column to roster
that contains the values returned by shared_first_and_section
roster = roster.assign(shared=roster.get('name').apply(shared_first_and_section))
name | section | first | shared | |
0 | Jacob Rrrkga | 10AM | Jacob | 3 |
1 | Steven Dhixba | 10AM | Steven | 1 |
2 | Rundong Qphuob | 9AM | Rundong | 1 |
... | ... | ... | ... | ... |
439 | Shreyesh Tlbgcn | 1PM | Shreyesh | 1 |
440 | Sally Tthnjf | 10AM | Sally | 1 |
441 | Lawrence Zdeaft | 10AM | Lawrence | 1 |
442 rows × 4 columns
Let's find all of the students who are in a section with someone that has the same first name as them.
roster[(roster.get('shared') >= 2)].sort_values('shared', ascending=False)
name | section | first | shared | |
177 | Ryan Pzwmaz | 10AM | Ryan | 4 |
19 | Ryan Xythay | 1PM | Ryan | 4 |
396 | Ryan Nmuqdi | 10AM | Ryan | 4 |
... | ... | ... | ... | ... |
117 | Stephanie Aeamxd | 9AM | Stephanie | 2 |
132 | Vincent Sntmqw | 9AM | Vincent | 2 |
297 | Andrew Ucbtjl | 1PM | Andrew | 2 |
52 rows × 4 columns
We can narrow this down to a particular lecture section if we'd like.
one_section_only = (
roster[(roster.get('shared') >= 2) &
(roster.get('section') == '9AM')]
.sort_values('shared', ascending=False)
name | section | first | shared | |
157 | Dylan Vrhpdh | 9AM | Dylan | 3 |
405 | Dylan Mkrwdd | 9AM | Dylan | 3 |
416 | Dylan Mrbwhm | 9AM | Dylan | 3 |
... | ... | ... | ... | ... |
315 | Jaden Nyqohu | 9AM | Jaden | 2 |
349 | Kevin Htltkm | 9AM | Kevin | 2 |
401 | Jaden Udzphm | 9AM | Jaden | 2 |
13 rows × 4 columns
For instance, the above DataFrame preview is telling us that there are 3 Dylans in the 9AM section.
# All of the names shared by multiple students in the 9AM section.
array(['Dylan', 'Joseph', 'Vincent', 'Stephanie', 'Kevin', 'Jaden'], dtype=object)
Sneak peek¶
While the DataFrames on the previous slide contain the info we were looking for, they're not organized very conveniently. For instance, there are three rows containing the fact that there are 3 Dylans in the 9AM lecture section.
Wouldn't it be great if we could create a DataFrame like the one below? We'll see how next time!

Find the longest first name in the class that is shared by at least two students in the same section.
Hint: You'll have to use both .assign
and .apply
✅ Click here to see the answer after you've tried it yourself.
with_len = roster.assign(name_len=roster.get('first').apply(len)) with_len[with_len.get('shared') >= 2].sort_values('name_len', ascending=False).get('first').iloc[0]
Summary, next time¶
- Functions are a way to divide our code into small subparts to prevent us from writing repetitive code.
- The
method allows us to call a function on every single element of a Series, which usually comes from.get
ting a column of a DataFrame.
Next time¶
More advanced DataFrame manipulations!