Lecture 3 – Strings and Arrays¶

DSC 10, Spring 2023¶

Announcements¶

  • Lab 0 is out and is due on Tuesday, April 11th at 11:59PM.
    • It contains a video 🎥 towards the end: Navigating DataHub and Jupyter Notebooks. Watching it should be a worthwhile investment of your time!
  • Please fill out the Welcome Survey!
  • You must be present when attendance is taken in discussion to get credit, even if you have a conflicting class.

Resources 🤝¶

  • We're covering a lot of content very quickly. If you're overwhelmed, just know that we're here to support you!
    • Ed and office hours are your friends! 🫂
  • Remember to check the Resources tab of the course website for programming resources.

Agenda¶

  • Recap: Data types.
  • Strings. 🧶
  • Lists.
  • Arrays.
  • Ranges.

Recap: Data types¶

int and float¶

  • Every value in Python has a type.
  • There are two numeric data types:
    • int: An integer of any size.
    • float: A number with a decimal point.
In [1]:
# int.
15 - 4
Out[1]:
11
In [2]:
# float.
6 * 0.2
Out[2]:
1.2000000000000002

Converting between int and float¶

  • If you mix ints and floats in an expression, the result will always be a float.
    • Note that when you divide two ints, you get a float back.
  • A value can be explicity coerced (i.e. converted) using the int and float functions.
In [3]:
2.0 + 3
Out[3]:
5.0
In [4]:
12 / 2
Out[4]:
6.0
In [5]:
# Want an integer back.
int(12 / 2)
Out[5]:
6
In [6]:
# int chops off the decimal point!
int(-2.9)
Out[6]:
-2

Strings 🧶¶

Strings 🧶¶

  • A string is a snippet of text of any length.
  • In Python, strings are enclosed by either single quotes or double quotes.
In [7]:
'woof'
Out[7]:
'woof'
In [8]:
type('woof')
Out[8]:
str
In [9]:
"woof"
Out[9]:
'woof'
In [10]:
# A string, not an int!
"1998"
Out[10]:
'1998'

String arithmetic¶

When using the + symbol between two strings, the operation is called "concatenation".

In [11]:
s1 = 'baby'
s2 = '🐼'
In [12]:
s1 + s2
Out[12]:
'baby🐼'
In [13]:
s1 + ' ' + s2
Out[13]:
'baby 🐼'
In [14]:
s2 * 3
Out[14]:
'🐼🐼🐼'

String methods¶

  • Associated with strings are special functions, called string methods.
  • Access string methods with a . after the string ("dot notation").
    • For instance, to use the upper method on string s, we write s.upper().
  • Examples include upper, title, and replace.
In [15]:
my_cool_string = 'data science is super cool!'
In [16]:
my_cool_string.title()
Out[16]:
'Data Science Is Super Cool!'
In [17]:
my_cool_string.upper()
Out[17]:
'DATA SCIENCE IS SUPER COOL!'
In [18]:
my_cool_string.replace('super cool', '💯' * 3)
Out[18]:
'data science is 💯💯💯!'
In [19]:
# len is not a method, since it doesn't use dot notation.
len(my_cool_string)
Out[19]:
27

Aside: print¶

  • By default, Jupyter Notebooks display the "raw" value of the expression of the last line in a cell.
  • The print function displays the value in human readable text when it's evaluated.
In [20]:
12 # 12 won't be displayed, since Python only shows the value of the last expression.
23
Out[20]:
23
In [21]:
# Note, there is no Out[number] to the left! That only appears when displaying a non-printed value.
# But both 12 and 23 are displayed.
print(12)
print(23)
12
23
In [22]:
# '\n' inserts a new line.
my_newline_str = 'Here is a string with two lines.\nHere is the second line!'  
my_newline_str
Out[22]:
'Here is a string with two lines.\nHere is the second line!'
In [23]:
# The quotes disappeared and the newline is rendered!
print(my_newline_str)  
Here is a string with two lines.
Here is the second line!

Type conversion to and from strings¶

  • Any value can be converted to a string using str.
  • Some strings can be converted to int and float.
In [24]:
str(3)
Out[24]:
'3'
In [25]:
float('3')
Out[25]:
3.0
In [26]:
int('4')
Out[26]:
4
In [27]:
int('baby panda')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/ch/hyjw6whx3g9gshnp58738jc80000gp/T/ipykernel_58527/455936715.py in <module>
----> 1 int('baby panda')

ValueError: invalid literal for int() with base 10: 'baby panda'
In [28]:
int('4.3')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/ch/hyjw6whx3g9gshnp58738jc80000gp/T/ipykernel_58527/756068685.py in <module>
----> 1 int('4.3')

ValueError: invalid literal for int() with base 10: '4.3'

Concept Check ✅ – Answer at cc.dsc10.com¶

Assume you have run the following statements:

x = 3
y = '4'
z = '5.6'

Choose the expression that will be evaluated without an error.

A. x + y

B. x + int(y + z)

C. str(x) + int(y)

D. str(x) + z

E. All of them have errors

Lists¶

Motivation¶

How would we store today's high temperature in several different cities?

Our best solution right now is to create a separate variable for each city.

In [29]:
temp_sandiego = 68
temp_losangeles = 73
temp_sanfrancisco = 60
temp_chicago = 50
temp_newyorkcity = 76
temp_boston = 50

This technically allows us to do things like compute the average temperature:

avg_temperature = 1/6 * (
    temp_sandiego
    + temp_losangeles
    + temp_sanfrancisco
    + ...)

Imagine we had 10 or 100 cities – there must be a better way!

Lists in Python¶

In Python, a list is used to store multiple values within a single value. To create a new list from scratch, we use [square brackets].

In [30]:
temperature_list = [68, 73, 60, 50, 76, 50]
In [31]:
len(temperature_list)
Out[31]:
6

Notice that the elements in a list don't need to be unique!

Lists make working with sequences easy!¶

To find the average temperature, we just need to divide the sum of the temperatures by the number of temperatures recorded:

In [32]:
temperature_list
Out[32]:
[68, 73, 60, 50, 76, 50]
In [33]:
sum(temperature_list) / len(temperature_list)
Out[33]:
62.833333333333336

Types¶

The type of a list is... list.

In [34]:
temperature_list
Out[34]:
[68, 73, 60, 50, 76, 50]
In [35]:
type(temperature_list)
Out[35]:
list

Within a list, you can store elements of different types.

In [36]:
mixed_list = [-2, 2.5, 'ucsd', [1, 3]]
mixed_list
Out[36]:
[-2, 2.5, 'ucsd', [1, 3]]

There's a problem...¶

  • Lists are very slow.
  • This is not a big deal when there aren't many entries, but it's a big problem when there are millions or billions of entries.

Arrays¶

NumPy¶

  • NumPy (pronounced "num pie") is a Python library (module) that provides support for arrays and operations on them.

  • The babypandas library, which you will learn about next week, goes hand-in-hand with NumPy.

    • NumPy is used heavily in the real world.
  • To use numpy, we need to import it. It's usually imported as np (but doesn't have to be!)

In [37]:
import numpy as np

Arrays¶

Think of NumPy arrays (just "arrays" from now on) as fancy, faster lists.

To create an array, we pass a list as input to the np.array function.

In [38]:
np.array([4, 9, 1, 2])
Out[38]:
array([4, 9, 1, 2])
In [39]:
temperature_array = np.array([68, 73, 60, 50, 76, 50])
temperature_array
Out[39]:
array([68, 73, 60, 50, 76, 50])
In [40]:
temperature_list
Out[40]:
[68, 73, 60, 50, 76, 50]
In [41]:
# No square brackets, because temperature_list is already a list!
np.array(temperature_list)
Out[41]:
array([68, 73, 60, 50, 76, 50])

Positions¶

When people stand in a line, each person has a position.

Similarly, each element of an array (and list) has a position.

Accessing elements by position¶

  • Python, like most programming languages, is "0-indexed."
    • This means that the position of the first element in an array is 0, not 1.
    • One interpretation is that an element's position represents the number of elements in front of it.
  • To access the element in array arr_name at position pos, we use the syntax arr_name[pos].
In [42]:
temperature_array
Out[42]:
array([68, 73, 60, 50, 76, 50])
In [43]:
temperature_array[0]
Out[43]:
68
In [44]:
temperature_array[1]
Out[44]:
73
In [45]:
temperature_array[3]
Out[45]:
50
In [46]:
# Access the last element.
temperature_array[5]
Out[46]:
50
In [47]:
# Doesn't work!
temperature_array[6]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/ch/hyjw6whx3g9gshnp58738jc80000gp/T/ipykernel_58527/3166117.py in <module>
      1 # Doesn't work!
----> 2 temperature_array[6]

IndexError: index 6 is out of bounds for axis 0 with size 6
In [48]:
# If a position is negative, count from the end!
temperature_array[-1]
Out[48]:
50

Types¶

Earlier in the lecture, we saw that lists can store elements of multiple types.

In [49]:
nums_and_strings_lst = ['uc', 'sd', 1961, 3.14]
nums_and_strings_lst
Out[49]:
['uc', 'sd', 1961, 3.14]

This is not true of arrays – all elements in an array must be of the same type.

In [50]:
# All elements are converted to strings!
np.array(nums_and_strings_lst)
Out[50]:
array(['uc', 'sd', '1961', '3.14'], dtype='<U32')

Array-number arithmetic¶

Arrays make it easy to perform the same operation to every element. This behavior is formally known as "broadcasting".

In [51]:
temperature_array
Out[51]:
array([68, 73, 60, 50, 76, 50])
In [52]:
# Increase all temperatures by 3 degrees.
temperature_array + 3
Out[52]:
array([71, 76, 63, 53, 79, 53])
In [53]:
# Halve all temperatures.
temperature_array / 2
Out[53]:
array([34. , 36.5, 30. , 25. , 38. , 25. ])
In [54]:
# Convert all temperatures to Celsius.
(5 / 9) * (temperature_array - 32)
Out[54]:
array([20.        , 22.77777778, 15.55555556, 10.        , 24.44444444,
       10.        ])

Note: In none of the above cells did we actually modify temperature_array! Each of those expressions created a new array.

In [55]:
temperature_array
Out[55]:
array([68, 73, 60, 50, 76, 50])

To actually change temperature_array, we need to reassign it to a new array.

In [56]:
temperature_array = (5 / 9) * (temperature_array - 32)
In [57]:
# Now in Celsius!
temperature_array
Out[57]:
array([20.        , 22.77777778, 15.55555556, 10.        , 24.44444444,
       10.        ])

Element-wise arithmetic¶

  • We can apply arithmetic operations to multiple arrays, provided they have the same length.
  • The result is computed element-wise, which means that the arithmetic operation is applied to one pair of elements from each array at a time.
  • For example, a + b is an array whose first element is the sum of the first element of a and first element of b.
In [58]:
a = np.array([4, 5, -1])
b = np.array([2, 3, 2])
In [59]:
a + b
Out[59]:
array([6, 8, 1])
In [60]:
a / b
Out[60]:
array([ 2.        ,  1.66666667, -0.5       ])
In [61]:
a ** 2 + b ** 2
Out[61]:
array([20, 34,  5])

Example: TikTok views 🎬¶

We decided to make a Series of TikToks called "A Day in the Life of a Data Scientist". The number of views we've received on these videos are stored in the array views below.

In [62]:
views = np.array([158, 352, 195, 1423916, 46])

Some questions:

What was our average view count?

In [63]:
views
Out[63]:
array([    158,     352,     195, 1423916,      46])
In [64]:
sum(views) / len(views)
Out[64]:
284933.4
In [65]:
# The mean method exists for arrays (but not for lists).
views.mean()
Out[65]:
284933.4

How many views did our most and least popular videos receive?

In [66]:
views
Out[66]:
array([    158,     352,     195, 1423916,      46])
In [67]:
views.max()
Out[67]:
1423916
In [68]:
views.min()
Out[68]:
46

How many views above average did each of our videos receive? How many views above average did our most viewed video receive?

In [69]:
views
Out[69]:
array([    158,     352,     195, 1423916,      46])
In [70]:
views - views.mean()
Out[70]:
array([-284775.4, -284581.4, -284738.4, 1138982.6, -284887.4])
In [71]:
(views - views.mean()).max()
Out[71]:
1138982.6

It has been estimated that TikTok pays their creators \$0.03 per 1000 views. If this is true, how many dollars did we earn on our most viewed video? 💸

In [72]:
views
Out[72]:
array([    158,     352,     195, 1423916,      46])
In [73]:
views.max() * 0.03 / 1000
Out[73]:
42.717479999999995

Ranges¶

Motivation¶

We often find ourselves needing to make arrays like this:

In [74]:
months_in_year = np.array([
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 
])

There needs to be an easier way to do this!

Ranges¶

  • A range is an array of evenly spaced numbers. We create ranges using np.arange.
  • The most general way to create a range is np.arange(start, end, step). This returns an array such that:
    • The first number is start. By default, start is 0.
    • All subsequent numbers are spaced out by step, until (but excluding) end. By default, step is 1.
In [75]:
# Start at 0, end before 8, step by 1.
# This will be our most common use-case!
np.arange(8)
Out[75]:
array([0, 1, 2, 3, 4, 5, 6, 7])
In [76]:
# Start at 5, end before 10, step by 1.
np.arange(5, 10)
Out[76]:
array([5, 6, 7, 8, 9])
In [77]:
# Start at 3, end before 32, step by 5.
np.arange(3, 32, 5)
Out[77]:
array([ 3,  8, 13, 18, 23, 28])
In [78]:
# Steps can be fractional!
np.arange(-3, 2, 0.5)
Out[78]:
array([-3. , -2.5, -2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5])
In [79]:
# If step is negative, we count backwards.
np.arange(1, -10, -3)
Out[79]:
array([ 1, -2, -5, -8])

Activity¶

🎉 Congrats! 🎉 You won the lottery 💰. Here's how your payout works: on the first day of January, you are paid \$0.01. Every day thereafter, your pay doubles, so on the second day you're paid \\$0.02, on the third day you're paid \$0.04, on the fourth day you're paid \\$0.08, and so on.

January has 31 days.

Write a one-line expression that uses the numbers 2 and 31, along with the function np.arange and the method .sum(), that computes the total amount in dollars you will be paid in January.

In [80]:
...
Out[80]:
Ellipsis

Summary, next time¶

Summary¶

  • Strings are used to store text. Enclose them in single or double quotes.
  • Lists and arrays are used to store sequences.
    • Arrays are faster and more convenient for numerical operations.
    • You can easily perform numerical operations on all elements of an array and perform operations on multiple arrays.
  • Ranges are arrays of equally-spaced numbers.
  • Remember to refer to the resources from the start of lecture!

Next time¶

We'll learn about how to use Python to work with real-world tabular data.