In [1]:
# Run this cell to set up packages for lecture.
from lec08_imports import *

Lecture 8 – Functions and Applying¶

DSC 10, Winter 2025¶

Agenda¶

  • Functions.
  • Applying functions to DataFrames.
    • Example: Student names.

*Reminder:* Use the DSC 10 Reference Sheet.

Functions¶

Defining functions¶

  • We've learned how to do quite a bit in Python:
    • Manipulate arrays, Series, and DataFrames.
    • Perform operations on strings.
    • Create visualizations.
  • But so far, we've been restricted to using existing functions (e.g. max, np.sqrt, len) and methods (e.g. .groupby, .assign, .plot).

Motivation¶

  • In Homework 1, you made an array containing all the multiples of 10, in ascending order, that appear on the multiplication table below.

No description has been provided for this image
In [8]:
multiples_of_10 = np.arange(10, 130, 10)
multiples_of_10
Out[8]:
array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120])
  • Question: How would you make an array containing all the multiples of 8, in increasing order, that appear on the multiplication table?
In [10]:
multiples_of_8 = np.arange(8, 13*8, 8)
multiples_of_8
Out[10]:
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96])

More generally¶

What if we want to find the multiples of some other number, k? We can copy-paste and change some numbers, but that is prone to error.

In [12]:
multiples_of_5 = ...
multiples_of_5
Out[12]:
Ellipsis

It turns out that we can define our own "multiples" function just once, and re-use it many times for different values of k. 🔁

In [14]:
def multiples(k):
    '''This function returns the 
    first twelve multiples of k.'''
    return np.arange(k, 13*k, k)
In [15]:
multiples(8)
Out[15]:
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96])
In [16]:
multiples(5)
Out[16]:
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60])

Note that we only had to specify how to calculate multiples a single time!

Functions¶

Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we define our own function in Python, we will use the following pattern.

In [19]:
show_def()

Functions are "recipes"¶

  • Functions take in inputs, known as arguments, do something, and produce some outputs.
  • The beauty of functions is that you don't need to know how they are implemented in order to use them!
    • For instance, you've been using the function bpd.read_csv without knowing how it works.
    • This is the premise of the idea of abstraction in computer science – you'll hear a lot about this if you take DSC 20.
In [21]:
multiples(7)
Out[21]:
array([ 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84])
In [22]:
multiples(-2)
Out[22]:
array([ -2,  -4,  -6,  -8, -10, -12, -14, -16, -18, -20, -22, -24])

Parameters and arguments¶

triple has one parameter, x.

In [24]:
def triple(x):
    return x * 3

When we call triple with the argument 5, within the body of triple, x means 5.

In [26]:
triple(5)
Out[26]:
15

We can change the argument we call triple with – we can even call it with strings!

In [28]:
triple(7 + 8)
Out[28]:
45
In [29]:
triple('triton')
Out[29]:
'tritontritontriton'

Scope 🩺¶

The names you choose for a function’s parameters are only known to that function (known as local scope). The rest of your notebook is unaffected by parameter names.

In [31]:
def triple(x):
    return x * 3
In [32]:
triple(7)
Out[32]:
21

Since we haven't defined an x outside of the body of triple, our notebook doesn't know what x means.

In [34]:
x
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 x

NameError: name 'x' is not defined

We can define an x outside of the body of triple, but that doesn't change how triple works.

In [ ]:
x = 15
In [ ]:
# When triple(12) is called, you can pretend
# there's an invisible line inside the body of x
# that says x = 12.
# The x = 15 above is ignored.
triple(12)

Functions can take 0 or more arguments¶

Functions can take any number of arguments.

greeting takes no arguments.

In [ ]:
def greeting():
    return 'Hi! 👋'
In [ ]:
greeting()

custom_multiples takes two arguments!

In [ ]:
def custom_multiples(k, how_many):
    '''This function returns the 
    first how_many multiples of k.'''
    return np.arange(k, (how_many + 1)*k, k)
In [ ]:
custom_multiples(10, 7)
In [35]:
custom_multiples(2, 100)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[35], line 1
----> 1 custom_multiples(2, 100)

NameError: name 'custom_multiples' is not defined

Functions don't run until you call them!¶

The body of a function is not run until you use (call) the function.

Here, we can define where_is_the_error without seeing an error message.

In [37]:
def where_is_the_error(something):
    '''A function to illustrate that errors don't occur 
    until functions are executed (called).'''
    return (1 / 0) + something

It is only when we call where_is_the_error that Python gives us an error message.

In [39]:
where_is_the_error(5)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[39], line 1
----> 1 where_is_the_error(5)

Cell In[37], line 4, in where_is_the_error(something)
      1 def where_is_the_error(something):
      2     '''A function to illustrate that errors don't occur 
      3     until functions are executed (called).'''
----> 4     return (1 / 0) + something

ZeroDivisionError: division by zero

Example: first_name¶

Let's create a function called first_name that takes in someone's full name and returns their first name. Example behavior is shown below.

>>> first_name('Pradeep Khosla')
'Pradeep'

Hint: Use the string method .split.

General strategy for writing functions:

  1. First, try and get the behavior to work on a single example.
  2. Then, encapsulate that behavior inside a function.
In [42]:
'Pradeep Khosla'.split(' ')[0]
Out[42]:
'Pradeep'
In [43]:
def first_name(full_name):
    '''Returns the first name given a full name.'''
    return full_name.split(' ')[0]
In [44]:
first_name('Pradeep Khosla')
Out[44]:
'Pradeep'
In [45]:
# What if there are three names?
first_name('Chancellor Pradeep Khosla')
Out[45]:
'Chancellor'

Returning¶

  • The return keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to.
  • Most functions we write will use return, but using return is not strictly required.
    • If you want to be able to save the output of your function to a variable, you must use return!
  • Be careful: print and return work differently!
In [47]:
def pythagorean(a, b):
    '''Computes the hypotenuse length of a right triangle with legs a and b.'''
    c = (a ** 2 + b ** 2) ** 0.5
    print(c)
In [48]:
x = pythagorean(3, 4)
5.0
In [49]:
# No output – why?
x
In [50]:
# Errors – why?
x + 10
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[50], line 2
      1 # Errors – why?
----> 2 x + 10

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
In [51]:
def better_pythagorean(a, b):
    '''Computes the hypotenuse length of a right triangle with legs a and b, 
       and actually returns the result.
    '''
    c = (a ** 2 + b ** 2) ** 0.5
    return c
In [52]:
x = better_pythagorean(3, 4)
x
Out[52]:
5.0
In [53]:
x + 10
Out[53]:
15.0

Returning¶

Once a function executes a return statement, it stops running.

In [55]:
def motivational(quote):
    return 0
    print("Here's a motivational quote:", quote)
In [56]:
motivational('Fall seven times and stand up eight.')
Out[56]:
0

Applying functions to DataFrames¶

DSC 10 student data¶

The DataFrame roster contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.

In [59]:
roster = bpd.read_csv('data/roster-anon.csv')
roster
Out[59]:
name section
0 Shawn Hhnxoq 10AM
1 Tom Egzuaz 11AM
2 Jiahao Zvwwyb 11AM
... ... ...
237 Jason Eglntp 11AM
238 Renee Fhlaos 11AM
239 Vivek Tbedny 11AM

240 rows × 2 columns

Example: Common first names¶

What is the most common first name among DSC 10 students? (Any guesses?)

In [61]:
roster
Out[61]:
name section
0 Shawn Hhnxoq 10AM
1 Tom Egzuaz 11AM
2 Jiahao Zvwwyb 11AM
... ... ...
237 Jason Eglntp 11AM
238 Renee Fhlaos 11AM
239 Vivek Tbedny 11AM

240 rows × 2 columns

  • Problem: We can't answer that right now, since we don't have a column with first names. If we did, we could group by it.
  • Solution: Use our function that extracts first names on every element of the 'name' column.

Using our first_name function¶

Somehow, we need to call first_name on every student's 'name'.

In [65]:
roster
Out[65]:
name section
0 Shawn Hhnxoq 10AM
1 Tom Egzuaz 11AM
2 Jiahao Zvwwyb 11AM
... ... ...
237 Jason Eglntp 11AM
238 Renee Fhlaos 11AM
239 Vivek Tbedny 11AM

240 rows × 2 columns

In [66]:
roster.get('name').iloc[0]
Out[66]:
'Shawn Hhnxoq'
In [67]:
first_name(roster.get('name').iloc[0])
Out[67]:
'Shawn'
In [68]:
first_name(roster.get('name').iloc[1])
Out[68]:
'Tom'

Ideally, there's a better solution than doing this hundreds of times...

.apply¶

  • To apply the function func_name to every element of column 'col' in DataFrame df, use
df.get('col').apply(func_name)
  • The .apply method is a Series method.
    • Important: We use .apply on Series, not DataFrames.
    • The output of .apply is also a Series.
  • Pass just the name of the function – don't call it!
    • Good ✅: .apply(first_name).
    • Bad ❌: .apply(first_name()).
In [74]:
roster.get('name')
Out[74]:
0       Shawn Hhnxoq
1         Tom Egzuaz
2      Jiahao Zvwwyb
           ...      
237     Jason Eglntp
238     Renee Fhlaos
239     Vivek Tbedny
Name: name, Length: 240, dtype: object
In [75]:
roster.get('name').apply(first_name)
Out[75]:
0       Shawn
1         Tom
2      Jiahao
        ...  
237     Jason
238     Renee
239     Vivek
Name: name, Length: 240, dtype: object

Example: Common first names¶

In [77]:
roster = roster.assign(
    first=roster.get('name').apply(first_name)
)
roster
Out[77]:
name section first
0 Shawn Hhnxoq 10AM Shawn
1 Tom Egzuaz 11AM Tom
2 Jiahao Zvwwyb 11AM Jiahao
... ... ... ...
237 Jason Eglntp 11AM Jason
238 Renee Fhlaos 11AM Renee
239 Vivek Tbedny 11AM Vivek

240 rows × 3 columns

Now that we have a column containing first names, we can find the distribution of first names.

In [79]:
name_counts = (
    roster
    .groupby('first')
    .count()
    .sort_values('name', ascending=False)
    .get(['name'])
)
name_counts
Out[79]:
name
first
Ryan 4
Andrew 4
Nathan 3
... ...
Jiahao 1
Jimbo 1
Zora 1

212 rows × 1 columns

Activity¶

Below:

  • Create a bar chart showing the number of students with each first name, but only include first names shared by at least two students.
  • Determine the proportion of students in DSC 10 who have a first name that is shared by at least two students.

Hint: Start by defining a DataFrame with only the names in name_counts that appeared at least twice. You can use this DataFrame to answer both questions.


✅ Click here to see the solutions after you've tried it yourself.

shared_names = name_counts[name_counts.get('name') >= 2]

# Bar chart.
shared_names.sort_values('name').plot(kind='barh', y='name');

# Proportion = # students with a shared name / total # of students.
shared_names.get('name').sum() / roster.shape[0]

In [81]:
...
Out[81]:
Ellipsis
In [82]:
...
Out[82]:
Ellipsis

.apply works with built-in functions, too!¶

In [84]:
name_counts.get('name')
Out[84]:
first
Ryan      4
Andrew    4
Nathan    3
         ..
Jiahao    1
Jimbo     1
Zora      1
Name: name, Length: 212, dtype: int64
In [85]:
# Not necessarily meaningful, but doable.
name_counts.get('name').apply(np.log)
Out[85]:
first
Ryan      1.39
Andrew    1.39
Nathan    1.10
          ... 
Jiahao    0.00
Jimbo     0.00
Zora      0.00
Name: name, Length: 212, dtype: float64

Aside: Resetting the index¶

In name_counts, first names are stored in the index, which is not a Series. This means we can't use .apply on it.

In [87]:
name_counts.index
Out[87]:
Index(['Ryan', 'Andrew', 'Nathan', 'Vanessa', 'Anthony', 'Andy', 'David',
       'Katherine', 'Noah', 'William',
       ...
       'Ishaan', 'Izabella', 'Jaden', 'Janelle', 'Jared', 'Jennifer',
       'Jeremiah', 'Jiahao', 'Jimbo', 'Zora'],
      dtype='object', name='first', length=212)
In [88]:
name_counts.index.apply(max)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[88], line 1
----> 1 name_counts.index.apply(max)

AttributeError: 'Index' object has no attribute 'apply'

To help, we can use .reset_index() to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.

In [90]:
# What is the max of an individual string?
name_counts.reset_index().get('first').apply(max)
Out[90]:
0      y
1      w
2      t
      ..
209    o
210    o
211    r
Name: first, Length: 212, dtype: object

Example: Shared first names and sections¶

  • Suppose you're one of the $20% of students in DSC 10 who has a first name that is shared with at least one other student.
  • Let's try and determine whether someone in your lecture section shares the same first name as you.
    • For example, maybe 'Jason Eglntp' wants to see if there's another 'Jason' in their section.

Strategy:

  1. Which section is 'Jason Eglntp' in?
  2. How many people in that section have a first name of 'Jason'?
In [93]:
roster
Out[93]:
name section first
0 Shawn Hhnxoq 10AM Shawn
1 Tom Egzuaz 11AM Tom
2 Jiahao Zvwwyb 11AM Jiahao
... ... ... ...
237 Jason Eglntp 11AM Jason
238 Renee Fhlaos 11AM Renee
239 Vivek Tbedny 11AM Vivek

240 rows × 3 columns

In [94]:
which_section = roster[roster.get('name') == 'Jason Eglntp'].get('section').iloc[0]
which_section
Out[94]:
'11AM'
In [95]:
first_cond = roster.get('first') == 'Jason' # A Boolean Series!
section_cond = roster.get('section') == which_section # A Boolean Series!
how_many = roster[first_cond & section_cond].shape[0]
how_many
Out[95]:
1

Another function: shared_first_and_section¶

Let's create a function named shared_first_and_section. It will take in the full name of a student and return the number of students in their section with the same first name and section (including them).

Note: This is the first function we're writing that involves using a DataFrame within the function – this is fine!

In [97]:
def shared_first_and_section(name):
    # First, find the row corresponding to that full name in roster.
    # We're assuming that full names are unique.
    row = roster[roster.get('name') == name]
    
    # Then, get that student's first name and section.
    first = row.get('first').iloc[0]
    section = row.get('section').iloc[0]
    
    # Now, find all the students with the same first name and section.
    shared_info = roster[(roster.get('first') == first) & (roster.get('section') == section)]
    
    # Return the number of such students.
    return shared_info.shape[0]
In [98]:
shared_first_and_section('Jason Eglntp')
Out[98]:
1

Now, let's add a column to roster that contains the values returned by shared_first_and_section.

In [100]:
roster = roster.assign(shared=roster.get('name').apply(shared_first_and_section))
roster
Out[100]:
name section first shared
0 Shawn Hhnxoq 10AM Shawn 1
1 Tom Egzuaz 11AM Tom 1
2 Jiahao Zvwwyb 11AM Jiahao 1
... ... ... ... ...
237 Jason Eglntp 11AM Jason 1
238 Renee Fhlaos 11AM Renee 1
239 Vivek Tbedny 11AM Vivek 1

240 rows × 4 columns

Let's find all of the students who are in a section with someone that has the same first name as them.

In [102]:
roster[(roster.get('shared') >= 2)].sort_values('shared', ascending=False)
Out[102]:
name section first shared
88 Ryan Nvrosl 11AM Ryan 3
133 Andy Vnalqe 10AM Andy 3
55 Andy Caktll 10AM Andy 3
... ... ... ... ...
39 Vanessa Mqyyub 10AM Vanessa 2
34 Amelia Grfclp 10AM Amelia 2
220 Nathan Sbyzyi 10AM Nathan 2

26 rows × 4 columns

We can narrow this down to a particular lecture section if we'd like.

In [104]:
one_section_only = (
    roster[(roster.get('shared') >= 2) & 
           (roster.get('section') == '10AM')]
    .sort_values('shared', ascending=False)
)
one_section_only
Out[104]:
name section first shared
133 Andy Vnalqe 10AM Andy 3
55 Andy Caktll 10AM Andy 3
203 Andy Puihyq 10AM Andy 3
... ... ... ... ...
39 Vanessa Mqyyub 10AM Vanessa 2
34 Amelia Grfclp 10AM Amelia 2
220 Nathan Sbyzyi 10AM Nathan 2

19 rows × 4 columns

For instance, the above DataFrame preview is telling us that there are 3 Andys in the 10AM section.

In [106]:
# All of the names shared by multiple students in the 10AM section.
one_section_only.get('first').unique()
Out[106]:
array(['Andy', 'Anthony', 'Nathan', 'Sophie', 'Noah', 'Andrew', 'Vanessa',
       'Amelia', 'Daniel'], dtype=object)

Sneak peek¶

While the DataFrames on the previous slide contain the info we were looking for, they're not organized very conveniently. For instance, there are three rows containing the fact that there are 3 Andys in the 10AM lecture section.

Wouldn't it be great if we could create a DataFrame like the one below? We'll see how next time!

No description has been provided for this image

Activity¶

Find the longest first name in the class that is shared by at least two students in the same section.

Hint: You'll have to use both .assign and .apply.


✅ Click here to see the answer after you've tried it yourself.

with_len = roster.assign(name_len=roster.get('first').apply(len))
with_len[with_len.get('shared') >= 2].sort_values('name_len', ascending=False).get('first').iloc[0]

In [110]:
...
Out[110]:
Ellipsis

Summary, next time¶

Summary¶

  • Functions are a way to divide our code into small subparts to prevent us from writing repetitive code.
  • The .apply method allows us to call a function on every single element of a Series, which usually comes from .getting a column of a DataFrame.

Next time¶

More advanced DataFrame manipulations!