import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)
loc and iloc.pandas and numpy.loc and iloc¶enrollments = pd.DataFrame({
'Name': ['Granger, Hermione', 'Potter, Harry', 'Weasley, Ron', 'Longbottom, Neville'],
'PID': ['A13245986', 'A17645384', 'A32438694', 'A52342436'],
'LVL': [1, 1, 1, 1]
})
enrollments
| Name | PID | LVL | |
|---|---|---|---|
| 0 | Granger, Hermione | A13245986 | 1 |
| 1 | Potter, Harry | A17645384 | 1 |
| 2 | Weasley, Ron | A32438694 | 1 |
| 3 | Longbottom, Neville | A52342436 | 1 |
So far, we used [] to select columns and loc to select rows.
enrollments.loc[enrollments['Name'] < 'M']['PID']
0 A13245986 3 A52342436 Name: PID, dtype: object
loc can also be used to select both rows and columns. The general pattern is:
df.loc[<row selector>, <column selector>]
Examples:
df.loc[idx_list, col_list] returns a DataFrame containing the rows in idx_list and columns in col_list.df.loc[bool_arr, col_list] returns a DataFrame contaning the rows for which bool_arr is True and columns in col_list.: is used as the first input, all rows are kept. If : is used as the second input, all columns are kept.enrollments
| Name | PID | LVL | |
|---|---|---|---|
| 0 | Granger, Hermione | A13245986 | 1 |
| 1 | Potter, Harry | A17645384 | 1 |
| 2 | Weasley, Ron | A32438694 | 1 |
| 3 | Longbottom, Neville | A52342436 | 1 |
enrollments.loc[enrollments['Name'] < 'M', 'PID']
0 A13245986 3 A52342436 Name: PID, dtype: object
enrollments.loc[enrollments['Name'] < 'M', ['PID']]
| PID | |
|---|---|
| 0 | A13245986 |
| 3 | A52342436 |
In df.loc[<row selection>, <column selection>]:
: syntax (e.g. 0:2, 'Name': 'PID').There are many, many more – see the pandas documentation for more.
enrollments
| Name | PID | LVL | |
|---|---|---|---|
| 0 | Granger, Hermione | A13245986 | 1 |
| 1 | Potter, Harry | A17645384 | 1 |
| 2 | Weasley, Ron | A32438694 | 1 |
| 3 | Longbottom, Neville | A52342436 | 1 |
enrollments.loc[2, 'LVL']
1
enrollments.loc[0:2, 'Name':'PID']
| Name | PID | |
|---|---|---|
| 0 | Granger, Hermione | A13245986 |
| 1 | Potter, Harry | A17645384 |
| 2 | Weasley, Ron | A32438694 |
iloc!¶iloc stands for "integer location".iloc is like loc, but it selects rows and columns based off of integer positions only.enrollments
| Name | PID | LVL | |
|---|---|---|---|
| 0 | Granger, Hermione | A13245986 | 1 |
| 1 | Potter, Harry | A17645384 | 1 |
| 2 | Weasley, Ron | A32438694 | 1 |
| 3 | Longbottom, Neville | A52342436 | 1 |
enrollments.iloc[2:4, 0:2]
| Name | PID | |
|---|---|---|
| 2 | Weasley, Ron | A32438694 |
| 3 | Longbottom, Neville | A52342436 |
other = enrollments.set_index('Name')
other
| PID | LVL | |
|---|---|---|
| Name | ||
| Granger, Hermione | A13245986 | 1 |
| Potter, Harry | A17645384 | 1 |
| Weasley, Ron | A32438694 | 1 |
| Longbottom, Neville | A52342436 | 1 |
other.iloc[2]
PID A32438694 LVL 1 Name: Weasley, Ron, dtype: object
other.loc[2]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 2 The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_57577/776803957.py in <module> ----> 1 other.loc[2] ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable, axis=axis) 932 933 def _is_scalar_access(self, key: tuple): ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1162 # fall thru to straight lookup 1163 self._validate_key(key, axis) -> 1164 return self._get_label(key, axis=axis) 1165 1166 def _get_slice_axis(self, slice_obj: slice, axis: int): ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _get_label(self, label, axis) 1111 def _get_label(self, label, axis: int): 1112 # GH#5667 this will fail if the label is not present in the axis. -> 1113 return self.obj.xs(label, axis=axis) 1114 1115 def _handle_lowerdim_multi_index_axis0(self, tup: tuple): ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level) 3774 raise TypeError(f"Expected label or tuple of labels, got {key}") from e 3775 else: -> 3776 loc = index.get_loc(key) 3777 3778 if isinstance(loc, np.ndarray): ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 2
Consider the DataFrame below.
jack = pd.DataFrame({1: ['fee', 'fi'], '1': ['fo', 'fum']})
jack
| 1 | 1 | |
|---|---|---|
| 0 | fee | fo |
| 1 | fi | fum |
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself.
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1, 1]

numpy's main object is the array. In numpy, arrays are:numpy is fast becausenumpy arrays are stored more efficiently in memory than, say, Python lists. numpy arrays.pandas is built upon numpy¶pandas is a numpy array with an index.numpy array.pandas are fast because they use numpy's implementations.to_numpy method.to_numpy returns a view of the original object, not a copy! Read more in the course notes..values is a soon-to-be-deprecated version of .to_numpy().arr = np.array([4, 2, 9, 15, -1])
arr
array([ 4, 2, 9, 15, -1])
ser = pd.Series(arr, index='a b c d e'.split(' '))
ser
a 4 b 2 c 9 d 15 e -1 dtype: int64
conv = ser.to_numpy()
conv
array([ 4, 2, 9, 15, -1])
conv[2] = 100
conv
array([ 4, 2, 100, 15, -1])
ser
a 4 b 2 c 100 d 15 e -1 dtype: int64
for-loops¶for-loops are slow when processing large datasets.numpy arithmetic is than using a for-loop, let's compute the distances between the origin $(0, 0)$ and 2000 random points $(x, y)$ in $\mathbb{R}^2$:for-loop.numpy).np.random.random(N) returns an array containing N numbers selected uniformly at random from the interval $[0, 1)$.N = 2000
x_arr = np.random.random(N)
y_arr = np.random.random(N)
coordinates = pd.DataFrame({"x": x_arr, "y": y_arr})
coordinates.head()
| x | y | |
|---|---|---|
| 0 | 0.155402 | 0.346735 |
| 1 | 0.922027 | 0.746329 |
| 2 | 0.855037 | 0.630200 |
| 3 | 0.755672 | 0.849689 |
| 4 | 0.518506 | 0.177432 |
Next, let's define a function that takes in a DataFrame like the one above and returns the distances between each point and the origin, using a for-loop.
def distances(df):
hyp_list = []
for i in df.index:
dist = (df.loc[i, 'x'] ** 2 + df.loc[i, 'y'] ** 2) ** 0.5
hyp_list.append(dist)
return hyp_list
The %timeit magic command can repeatedly run any snippet of code and give us its average runtime.
%timeit distances(coordinates)
20.3 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, using a vectorized approach:
%timeit (coordinates['x'] ** 2 + coordinates['y'] ** 2) ** 0.5
136 µs ± 752 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note that "µs" refers to microseconds, which are one-millionth of a second, whereas "ms" refers to milliseconds, which are one-thousandth of a second.
Takeaway: avoid for-loops whenever possible!
pandas data types¶pandas refers to the type of values in a column.pandas tries to guess the correct data type for a given DataFrame, and is often wrong.pandas data types¶| Pandas dtype | Python type | NumPy type | SQL type | Usage |
|---|---|---|---|---|
| int64 | int | int_, int8,...,int64, uint8,...,uint64 | INT, BIGINT | Integer numbers |
| float64 | float | float_, float16, float32, float64 | FLOAT | Floating point numbers |
| bool | bool | bool_ | BOOL | True/False values |
| datetime64 | NA | datetime64[ns] | DATETIME | Date and time values |
| timedelta[ns] | NA | NA | NA | Differences between two datetimes |
| category | NA | NA | ENUM | Finite list of text values |
| object | str | string, unicode | NA | Text |
| object | NA | object | NA | Mixed types |
This article details how pandas stores different data types under the hood.
numpy array(s)¶dtypes attribute (of both Series and DataFrames) describes the data type of each column.to_numpy method, when used on a Series, returns an array in which all values are of the data type specified by dtypes.to_numpy method, when used on a DataFrame, returns a multi-dimensional array of type object, unless all columns in the DataFrame are homogenous.# Read in file
elections_fp = os.path.join('data', 'elections.csv')
elections = pd.read_csv(elections_fp)
elections.head()
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win |
| 1 | Carter | Democratic | 41.0 | 1980 | loss |
| 2 | Anderson | Independent | 6.6 | 1980 | loss |
| 3 | Reagan | Republican | 58.8 | 1984 | win |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss |
elections.dtypes
Candidate object Party object % float64 Year int64 Result object dtype: object
elections['Year'].dtypes
dtype('int64')
elections['Year'].to_numpy().dtype
dtype('int64')
elections.to_numpy()
array([['Reagan', 'Republican', 50.7, 1980, 'win'],
['Carter', 'Democratic', 41.0, 1980, 'loss'],
['Anderson', 'Independent', 6.6, 1980, 'loss'],
['Reagan', 'Republican', 58.8, 1984, 'win'],
['Mondale', 'Democratic', 37.6, 1984, 'loss'],
['Bush', 'Republican', 53.4, 1988, 'win'],
['Dukakis', 'Democratic', 45.6, 1988, 'loss'],
['Clinton', 'Democratic', 43.0, 1992, 'win'],
['Bush', 'Republican', 37.4, 1992, 'loss'],
['Perot', 'Independent', 18.9, 1992, 'loss'],
['Clinton', 'Democratic', 49.2, 1996, 'win'],
['Dole', 'Republican', 40.7, 1996, 'loss'],
['Perot', 'Independent', 8.4, 1996, 'loss'],
['Gore', 'Democratic', 48.4, 2000, 'loss'],
['Bush', 'Republican', 47.9, 2000, 'win'],
['Kerry', 'Democratic', 48.3, 2004, 'loss'],
['Bush', 'Republican', 50.7, 2004, 'win'],
['Obama', 'Democratic', 52.9, 2008, 'win'],
['McCain', 'Republican', 45.7, 2008, 'loss'],
['Obama', 'Democratic', 51.1, 2012, 'win'],
['Romney', 'Republican', 47.2, 2012, 'loss'],
['Clinton', 'Democratic', 48.2, 2016, 'loss'],
['Trump', 'Republican', 46.1, 2016, 'win'],
['Biden', 'Democratic', 51.3, 2020, 'win'],
['Trump', 'Republican', 46.9, 2020, 'loss']], dtype=object)
What do you think is happening here?
elections['Year'] ** 7
0 -9176136658659852288 1 -9176136658659852288 2 -9176136658659852288 3 -8884621300430012416 4 -8884621300430012416 5 -6382217169766957056 6 -6382217169766957056 7 -1459841187188834304 8 -1459841187188834304 9 -1459841187188834304 10 6093279273289170944 11 6093279273289170944 12 6093279273289170944 13 -1957127470578663424 14 -1957127470578663424 15 -6950134928041361408 16 -6950134928041361408 17 -8669840355176218624 18 -8669840355176218624 19 -6898610338587885568 20 -6898610338587885568 21 -1417070411446747136 22 -1417070411446747136 23 7995905371925528576 24 7995905371925528576 Name: Year, dtype: int64
numpy and pandas don't always make the same decisions!¶numpy prefers homogenous data types to optimize memory and read/write speed. This leads to type coercion. Notice that the array created below contains only strings, even though there was an int in the argument list.
np.array(['a', 1])
array(['a', '1'], dtype='<U21')
On the other hand, pandas likes correctness and ease-of-use. The Series created below is of type object, which preserves the original data types in the argument list.
pd.Series(['a', 1])
0 a 1 1 dtype: object
pd.Series(['a', 1]).values
array(['a', 1], dtype=object)
You can specify the data type of an array when initializing it by using the dtype argument.
np.array(['a', 1], dtype=object)
array(['a', 1], dtype=object)
pandas does make some trade-offs for efficiency, however. For instance, a Series consisting of both ints and floats is coerced to the float64 data type.
pd.Series([1, 1.0])
0 1.0 1 1.0 dtype: float64
You can change the data type of a Series using the .astype Series method.
ser = pd.Series(['1', '2', '3', '4'])
ser
0 1 1 2 2 3 3 4 dtype: object
ser.astype(int)
0 1 1 2 2 3 3 4 dtype: int64
ser.astype(float)
0 1.0 1 2.0 2 3.0 3 4.0 dtype: float64
As we just discovered,
numpy is optimized for speed and memory consumption.pandas makes implementation choices that: To demonstrate, let's create a large array in which all of the entries are non-negative numbers less than 255, meaning that they can be represented with 8 bits (i.e. as np.uint8s, where the "u" stands for "unsigned").
import random
data = np.random.choice(np.arange(8), 10 ** 6)
When we tell pandas to use a dtype of uint8, the size of the resulting DataFrame is under a megabyte.
ser1 = pd.Series(data, dtype=np.uint8).to_frame()
ser1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 1000000 non-null uint8 dtypes: uint8(1) memory usage: 976.7 KB
But by default, even though the numbers are only 8-bit, pandas uses the int64 dtype, and the resulting DataFrame is over 7 megabytes large.
ser2 = pd.Series(data).to_frame()
ser2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 1000000 non-null int64 dtypes: int64(1) memory usage: 7.6 MB
head/tail methods return the first/last few rows (the default is 5).shape attribute returns the number of rows (and columns).size attribute returns the number of entries.elections.head()
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win |
| 1 | Carter | Democratic | 41.0 | 1980 | loss |
| 2 | Anderson | Independent | 6.6 | 1980 | loss |
| 3 | Reagan | Republican | 58.8 | 1984 | win |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss |
elections.shape
(25, 5)
elections.size
125
| Method Name | Description |
|---|---|
count |
Returns the number of non-null entries in the Series |
unique |
Returns the unique values in the Series |
nunique |
Returns the number of unique values in the Series |
value_counts |
Returns a Series of counts of unique values |
describe |
Returns a Series of descriptive stats of values |
elections.head()
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win |
| 1 | Carter | Democratic | 41.0 | 1980 | loss |
| 2 | Anderson | Independent | 6.6 | 1980 | loss |
| 3 | Reagan | Republican | 58.8 | 1984 | win |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss |
# Distinct candidates
elections['Candidate'].unique()
array(['Reagan', 'Carter', 'Anderson', 'Mondale', 'Bush', 'Dukakis',
'Clinton', 'Perot', 'Dole', 'Gore', 'Kerry', 'Obama', 'McCain',
'Romney', 'Trump', 'Biden'], dtype=object)
# Number of distinct candidates
elections['Candidate'].nunique()
16
# Total number of candidates
elections['Candidate'].count()
25
# 🤔
republicans = elections.loc[elections['Party'] == 'Republican']
republicans['Result'].value_counts()
win 6 loss 5 Name: Result, dtype: int64
republicans['%'].describe()
count 11.000000 mean 47.772727 std 5.798464 min 37.400000 25% 45.900000 50% 47.200000 75% 50.700000 max 58.800000 Name: %, dtype: float64
axis keyword argument:axis=0: the method is applied across the rows (i.e. to each column).axis=1: the method is applied across the columns (i.e. to each row).axis=0.elections.head()
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win |
| 1 | Carter | Democratic | 41.0 | 1980 | loss |
| 2 | Anderson | Independent | 6.6 | 1980 | loss |
| 3 | Reagan | Republican | 58.8 | 1984 | win |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss |
elections[['%', 'Year']].mean()
% 43.04 Year 1998.72 dtype: float64
The following piece of code works, but is meaningless. Why?
elections[['%', 'Year']].mean(axis=1)
0 1015.35 1 1010.50 2 993.30 3 1021.40 4 1010.80 5 1020.70 6 1016.80 7 1017.50 8 1014.70 9 1005.45 10 1022.60 11 1018.35 12 1002.20 13 1024.20 14 1023.95 15 1026.15 16 1027.35 17 1030.45 18 1026.85 19 1031.55 20 1029.60 21 1032.10 22 1031.05 23 1035.65 24 1033.45 dtype: float64
| Method Name | Description |
|---|---|
sort_values |
Returns a DataFrame sorted by the specified column |
drop_duplicates |
Returns a DataFrame with duplicate values dropped |
describe |
Returns descriptive stats of the DataFrame |
elections.sort_values('%', ascending=False).head(4)
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 3 | Reagan | Republican | 58.8 | 1984 | win |
| 5 | Bush | Republican | 53.4 | 1988 | win |
| 17 | Obama | Democratic | 52.9 | 2008 | win |
| 23 | Biden | Democratic | 51.3 | 2020 | win |
# By default, drop_duplicates looks for duplicate entire rows, which elections does not have
elections.drop_duplicates(subset=['Candidate'])
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win |
| 1 | Carter | Democratic | 41.0 | 1980 | loss |
| 2 | Anderson | Independent | 6.6 | 1980 | loss |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss |
| 5 | Bush | Republican | 53.4 | 1988 | win |
| 6 | Dukakis | Democratic | 45.6 | 1988 | loss |
| 7 | Clinton | Democratic | 43.0 | 1992 | win |
| 9 | Perot | Independent | 18.9 | 1992 | loss |
| 11 | Dole | Republican | 40.7 | 1996 | loss |
| 13 | Gore | Democratic | 48.4 | 2000 | loss |
| 15 | Kerry | Democratic | 48.3 | 2004 | loss |
| 17 | Obama | Democratic | 52.9 | 2008 | win |
| 18 | McCain | Republican | 45.7 | 2008 | loss |
| 20 | Romney | Republican | 47.2 | 2012 | loss |
| 22 | Trump | Republican | 46.1 | 2016 | win |
| 23 | Biden | Democratic | 51.3 | 2020 | win |
elections.describe()
| % | Year | |
|---|---|---|
| count | 25.000000 | 25.000000 |
| mean | 43.040000 | 1998.720000 |
| std | 13.046136 | 12.843675 |
| min | 6.600000 | 1980.000000 |
| 25% | 41.000000 | 1988.000000 |
| 50% | 47.200000 | 1996.000000 |
| 75% | 50.700000 | 2008.000000 |
| max | 58.800000 | 2020.000000 |
assign method.append method.assign and append return a copy of the DataFrame, which is a great feature!As an aside, you should try your best to write chained pandas code, as follows:
(
elections
.assign(proportion_of_vote=(elections['%'] / 100))
.head()
)
| Candidate | Party | % | Year | Result | proportion_of_vote | |
|---|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980 | loss | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980 | loss | 0.066 |
| 3 | Reagan | Republican | 58.8 | 1984 | win | 0.588 |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss | 0.376 |
You can chain together several steps at a time:
(
elections
.assign(proportion_of_vote=(elections['%'] / 100))
.assign(Result=elections['Result'].str.upper())
.head()
)
| Candidate | Party | % | Year | Result | proportion_of_vote | |
|---|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | WIN | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980 | LOSS | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980 | LOSS | 0.066 |
| 3 | Reagan | Republican | 58.8 | 1984 | WIN | 0.588 |
| 4 | Mondale | Democratic | 37.6 | 1984 | LOSS | 0.376 |
You can also use assign when the desired column name has spaces, by using keyword arguments.
(
elections
.assign(**{'Proportion of Vote': elections['%'] / 100})
.head()
)
| Candidate | Party | % | Year | Result | Proportion of Vote | |
|---|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980 | loss | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980 | loss | 0.066 |
| 3 | Reagan | Republican | 58.8 | 1984 | win | 0.588 |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss | 0.376 |
append has terrible time complexity!append in a loop.loc or [].assign, this modifies the underlying DataFrame rather than a copy of it.# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
mod_elec = elections.copy()
mod_elec.head()
| Candidate | Party | % | Year | Result | |
|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win |
| 1 | Carter | Democratic | 41.0 | 1980 | loss |
| 2 | Anderson | Independent | 6.6 | 1980 | loss |
| 3 | Reagan | Republican | 58.8 | 1984 | win |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss |
mod_elec['Proportion of Vote'] = mod_elec['%'] / 100
mod_elec.head()
| Candidate | Party | % | Year | Result | Proportion of Vote | |
|---|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | win | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980 | loss | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980 | loss | 0.066 |
| 3 | Reagan | Republican | 58.8 | 1984 | win | 0.588 |
| 4 | Mondale | Democratic | 37.6 | 1984 | loss | 0.376 |
mod_elec['Result'] = mod_elec['Result'].str.upper()
mod_elec.head()
| Candidate | Party | % | Year | Result | Proportion of Vote | |
|---|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980 | WIN | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980 | LOSS | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980 | LOSS | 0.066 |
| 3 | Reagan | Republican | 58.8 | 1984 | WIN | 0.588 |
| 4 | Mondale | Democratic | 37.6 | 1984 | LOSS | 0.376 |
# 🤔
mod_elec.loc[-1, :] = ['Carter', 'Democratic', 50.1, 1976, 'WIN', 0.501]
mod_elec.loc[-2, :] = ['Ford', 'Republican', 48.0, 1976, 'LOSS', 0.48]
mod_elec
| Candidate | Party | % | Year | Result | Proportion of Vote | |
|---|---|---|---|---|---|---|
| 0 | Reagan | Republican | 50.7 | 1980.0 | WIN | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980.0 | LOSS | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980.0 | LOSS | 0.066 |
| 3 | Reagan | Republican | 58.8 | 1984.0 | WIN | 0.588 |
| 4 | Mondale | Democratic | 37.6 | 1984.0 | LOSS | 0.376 |
| 5 | Bush | Republican | 53.4 | 1988.0 | WIN | 0.534 |
| 6 | Dukakis | Democratic | 45.6 | 1988.0 | LOSS | 0.456 |
| 7 | Clinton | Democratic | 43.0 | 1992.0 | WIN | 0.430 |
| 8 | Bush | Republican | 37.4 | 1992.0 | LOSS | 0.374 |
| 9 | Perot | Independent | 18.9 | 1992.0 | LOSS | 0.189 |
| 10 | Clinton | Democratic | 49.2 | 1996.0 | WIN | 0.492 |
| 11 | Dole | Republican | 40.7 | 1996.0 | LOSS | 0.407 |
| 12 | Perot | Independent | 8.4 | 1996.0 | LOSS | 0.084 |
| 13 | Gore | Democratic | 48.4 | 2000.0 | LOSS | 0.484 |
| 14 | Bush | Republican | 47.9 | 2000.0 | WIN | 0.479 |
| 15 | Kerry | Democratic | 48.3 | 2004.0 | LOSS | 0.483 |
| 16 | Bush | Republican | 50.7 | 2004.0 | WIN | 0.507 |
| 17 | Obama | Democratic | 52.9 | 2008.0 | WIN | 0.529 |
| 18 | McCain | Republican | 45.7 | 2008.0 | LOSS | 0.457 |
| 19 | Obama | Democratic | 51.1 | 2012.0 | WIN | 0.511 |
| 20 | Romney | Republican | 47.2 | 2012.0 | LOSS | 0.472 |
| 21 | Clinton | Democratic | 48.2 | 2016.0 | LOSS | 0.482 |
| 22 | Trump | Republican | 46.1 | 2016.0 | WIN | 0.461 |
| 23 | Biden | Democratic | 51.3 | 2020.0 | WIN | 0.513 |
| 24 | Trump | Republican | 46.9 | 2020.0 | LOSS | 0.469 |
| -1 | Carter | Democratic | 50.1 | 1976.0 | WIN | 0.501 |
| -2 | Ford | Republican | 48.0 | 1976.0 | LOSS | 0.480 |
mod_elec = mod_elec.sort_index()
mod_elec.head()
| Candidate | Party | % | Year | Result | Proportion of Vote | |
|---|---|---|---|---|---|---|
| -2 | Ford | Republican | 48.0 | 1976.0 | LOSS | 0.480 |
| -1 | Carter | Democratic | 50.1 | 1976.0 | WIN | 0.501 |
| 0 | Reagan | Republican | 50.7 | 1980.0 | WIN | 0.507 |
| 1 | Carter | Democratic | 41.0 | 1980.0 | LOSS | 0.410 |
| 2 | Anderson | Independent | 6.6 | 1980.0 | LOSS | 0.066 |
# df.reset_index(drop=True) drops the current index
# of the DataFrame and replaces it with an index of increasing integers
mod_elec.reset_index(drop=True)
| Candidate | Party | % | Year | Result | Proportion of Vote | |
|---|---|---|---|---|---|---|
| 0 | Ford | Republican | 48.0 | 1976.0 | LOSS | 0.480 |
| 1 | Carter | Democratic | 50.1 | 1976.0 | WIN | 0.501 |
| 2 | Reagan | Republican | 50.7 | 1980.0 | WIN | 0.507 |
| 3 | Carter | Democratic | 41.0 | 1980.0 | LOSS | 0.410 |
| 4 | Anderson | Independent | 6.6 | 1980.0 | LOSS | 0.066 |
| 5 | Reagan | Republican | 58.8 | 1984.0 | WIN | 0.588 |
| 6 | Mondale | Democratic | 37.6 | 1984.0 | LOSS | 0.376 |
| 7 | Bush | Republican | 53.4 | 1988.0 | WIN | 0.534 |
| 8 | Dukakis | Democratic | 45.6 | 1988.0 | LOSS | 0.456 |
| 9 | Clinton | Democratic | 43.0 | 1992.0 | WIN | 0.430 |
| 10 | Bush | Republican | 37.4 | 1992.0 | LOSS | 0.374 |
| 11 | Perot | Independent | 18.9 | 1992.0 | LOSS | 0.189 |
| 12 | Clinton | Democratic | 49.2 | 1996.0 | WIN | 0.492 |
| 13 | Dole | Republican | 40.7 | 1996.0 | LOSS | 0.407 |
| 14 | Perot | Independent | 8.4 | 1996.0 | LOSS | 0.084 |
| 15 | Gore | Democratic | 48.4 | 2000.0 | LOSS | 0.484 |
| 16 | Bush | Republican | 47.9 | 2000.0 | WIN | 0.479 |
| 17 | Kerry | Democratic | 48.3 | 2004.0 | LOSS | 0.483 |
| 18 | Bush | Republican | 50.7 | 2004.0 | WIN | 0.507 |
| 19 | Obama | Democratic | 52.9 | 2008.0 | WIN | 0.529 |
| 20 | McCain | Republican | 45.7 | 2008.0 | LOSS | 0.457 |
| 21 | Obama | Democratic | 51.1 | 2012.0 | WIN | 0.511 |
| 22 | Romney | Republican | 47.2 | 2012.0 | LOSS | 0.472 |
| 23 | Clinton | Democratic | 48.2 | 2016.0 | LOSS | 0.482 |
| 24 | Trump | Republican | 46.1 | 2016.0 | WIN | 0.461 |
| 25 | Biden | Democratic | 51.3 | 2020.0 | WIN | 0.513 |
| 26 | Trump | Republican | 46.9 | 2020.0 | LOSS | 0.469 |
Note: We probably won't finish looking at all of this code in lecture, but we will leave it here for you as a reference.
Let's work with the same dataset that we did in Lecture 1, using our new knowledge of pandas.
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')
salaries['Employee Name'] = salaries['Employee Name'].str.split().str[0] + ' xxxxx'
salaries.head()
| Employee Name | Job Title | Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Total Pay & Benefits | Year | Notes | Agency | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Michael xxxxx | Police Officer | 117691.0 | 187290.0 | 13331.00 | 36380.0 | 318312.0 | 354692.0 | 2020 | NaN | San Diego | FT |
| 1 | Gary xxxxx | Police Officer | 117691.0 | 160062.0 | 42946.00 | 31795.0 | 320699.0 | 352494.0 | 2020 | NaN | San Diego | FT |
| 2 | Eric xxxxx | Fire Engineer | 35698.0 | 204462.0 | 69121.00 | 38362.0 | 309281.0 | 347643.0 | 2020 | NaN | San Diego | PT |
| 3 | Gregg xxxxx | Retirement Administrator | 305000.0 | 0.0 | 12814.00 | 24792.0 | 317814.0 | 342606.0 | 2020 | NaN | San Diego | FT |
| 4 | Joseph xxxxx | Fire Battalion Chief | 94451.0 | 157778.0 | 48151.00 | 42096.0 | 300380.0 | 342476.0 | 2020 | NaN | San Diego | FT |
salaries.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12605 entries, 0 to 12604 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Employee Name 12605 non-null object 1 Job Title 12605 non-null object 2 Base Pay 12605 non-null float64 3 Overtime Pay 12605 non-null float64 4 Other Pay 12605 non-null object 5 Benefits 12605 non-null float64 6 Total Pay 12605 non-null float64 7 Total Pay & Benefits 12605 non-null float64 8 Year 12605 non-null int64 9 Notes 0 non-null float64 10 Agency 12605 non-null object 11 Status 12605 non-null object dtypes: float64(6), int64(1), object(5) memory usage: 1.2+ MB
Current issues with the dataset:
'Notes') or the same value in all rows ('Agency') – let's drop them.'Other Pay' should be numeric, but it's not currently.# Dropping useless columns
salaries = salaries.drop(['Year', 'Notes', 'Agency'], axis=1)
salaries.head()
| Employee Name | Job Title | Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Total Pay & Benefits | Status | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Michael xxxxx | Police Officer | 117691.0 | 187290.0 | 13331.00 | 36380.0 | 318312.0 | 354692.0 | FT |
| 1 | Gary xxxxx | Police Officer | 117691.0 | 160062.0 | 42946.00 | 31795.0 | 320699.0 | 352494.0 | FT |
| 2 | Eric xxxxx | Fire Engineer | 35698.0 | 204462.0 | 69121.00 | 38362.0 | 309281.0 | 347643.0 | PT |
| 3 | Gregg xxxxx | Retirement Administrator | 305000.0 | 0.0 | 12814.00 | 24792.0 | 317814.0 | 342606.0 | FT |
| 4 | Joseph xxxxx | Fire Battalion Chief | 94451.0 | 157778.0 | 48151.00 | 42096.0 | 300380.0 | 342476.0 | FT |
'Other Pay' column¶salaries['Other Pay'].dtype
dtype('O')
salaries['Other Pay'].unique()
array(['13331.00', '42946.00', '69121.00', ..., '119.00', '47.00', '4.00'],
dtype=object)
It appears that most of the values in the 'Other Pay' column are strings containing numbers. Which values are not numbers?
salaries.loc[-salaries['Other Pay'].str.contains('.00')]
| Employee Name | Job Title | Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Total Pay & Benefits | Status | |
|---|---|---|---|---|---|---|---|---|---|
| 10347 | Elmo xxxxx | Laborer | 27288.0 | 489.0 | Aggregate | 1103.0 | 27774.0 | 28877.0 | PT |
| 10624 | John xxxxx | Golf Operations Assistant | 19278.0 | 0.0 | Aggregate | 1157.0 | 19275.0 | 20432.0 | PT |
| 11988 | Benjamin xxxxx | Stock Clerk(Auto Parts Stock Clrk) | 1234.0 | 54.0 | Aggregate | 681.0 | 1285.0 | 1966.0 | PT |
| 12079 | Eric xxxxx | Traffic Signal Technician 1 | 0.0 | 1290.0 | Aggregate | 119.0 | 1287.0 | 1406.0 | PT |
| 12088 | Teshome xxxxx | Instrumentation & Control Tech | 0.0 | 1248.0 | Aggregate | 115.0 | 1245.0 | 1360.0 | PT |
| 12093 | Nicholas xxxxx | Deputy Chief Oper Ofcr | 1231.0 | 0.0 | Aggregate | 113.0 | 1228.0 | 1341.0 | PT |
| 12097 | Jonathan xxxxx | Laborer | 1198.0 | 0.0 | Aggregate | 110.0 | 1195.0 | 1305.0 | PT |
| 12098 | Gilberto xxxxx | Laborer | 1198.0 | 0.0 | Aggregate | 110.0 | 1195.0 | 1305.0 | PT |
| 12101 | Jason xxxxx | Plant Procs Cntrl Electrician | 0.0 | 1150.0 | Aggregate | 106.0 | 1147.0 | 1253.0 | PT |
| 12115 | Raynaldo xxxxx | Asoc Eng-Electrical | 0.0 | 1113.0 | Aggregate | 102.0 | 1110.0 | 1212.0 | PT |
| 12124 | Mario xxxxx | Heavy Truck Drvr 2 | 0.0 | 1068.0 | Aggregate | 98.0 | 1065.0 | 1163.0 | PT |
| 12142 | Ivan xxxxx | Utility Worker 1 | 0.0 | 1002.0 | Aggregate | 92.0 | 999.0 | 1091.0 | PT |
We can keep just the rows where the 'Other Pay' is numeric, and then convert the 'Other Pay' column to float.
salaries = salaries.loc[salaries['Other Pay'].str.contains('.00') == True]
salaries['Other Pay'] = salaries['Other Pay'].astype(float)
salaries.head()
| Employee Name | Job Title | Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Total Pay & Benefits | Status | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Michael xxxxx | Police Officer | 117691.0 | 187290.0 | 13331.0 | 36380.0 | 318312.0 | 354692.0 | FT |
| 1 | Gary xxxxx | Police Officer | 117691.0 | 160062.0 | 42946.0 | 31795.0 | 320699.0 | 352494.0 | FT |
| 2 | Eric xxxxx | Fire Engineer | 35698.0 | 204462.0 | 69121.0 | 38362.0 | 309281.0 | 347643.0 | PT |
| 3 | Gregg xxxxx | Retirement Administrator | 305000.0 | 0.0 | 12814.0 | 24792.0 | 317814.0 | 342606.0 | FT |
| 4 | Joseph xxxxx | Fire Battalion Chief | 94451.0 | 157778.0 | 48151.0 | 42096.0 | 300380.0 | 342476.0 | FT |
The line of code above is correct, but it errors if you run it more than once. Why? 🤔
What happens when we use normalize=True with value_counts?
salaries['Status'].value_counts()
FT 8352 PT 4241 Name: Status, dtype: int64
salaries['Status'].value_counts(normalize=True)
FT 0.663226 PT 0.336774 Name: Status, dtype: float64
# Salary statistics
salaries.describe()
| Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Total Pay & Benefits | |
|---|---|---|---|---|---|---|
| count | 12593.000000 | 12593.000000 | 12593.000000 | 12593.000000 | 12593.000000 | 12593.000000 |
| mean | 56116.049948 | 8796.886683 | 10335.356388 | 15471.286905 | 75248.293020 | 90719.579925 |
| std | 35754.156457 | 18514.818694 | 13689.953848 | 11179.525879 | 49609.659842 | 58801.814650 |
| min | 0.000000 | -293.000000 | -25236.000000 | -39.000000 | 0.000000 | 1.000000 |
| 25% | 32830.000000 | 0.000000 | 1249.000000 | 6461.000000 | 41266.000000 | 52567.000000 |
| 50% | 54749.000000 | 588.000000 | 6791.000000 | 14582.000000 | 71421.000000 | 86301.000000 |
| 75% | 77512.000000 | 9190.000000 | 13796.000000 | 22748.000000 | 107039.000000 | 128250.000000 |
| max | 305000.000000 | 204462.000000 | 144201.000000 | 86066.000000 | 320699.000000 | 354692.000000 |
Question: Is 'Total Pay' equal to the sum of 'Base Pay', 'Overtime Pay', and 'Other Pay'?
We can answer this by summing the latter three columns and seeing if the resulting Series equals the former column.
salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1)
0 318312.0
1 320699.0
2 309281.0
3 317814.0
4 300380.0
...
12600 2.0
12601 2.0
12602 1.0
12603 1.0
12604 1.0
Length: 12593, dtype: float64
salaries['Total Pay']
0 318312.0
1 320699.0
2 309281.0
3 317814.0
4 300380.0
...
12600 2.0
12601 2.0
12602 1.0
12603 1.0
12604 1.0
Name: Total Pay, Length: 12593, dtype: float64
(salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1) == salaries['Total Pay']).all()
True
Similarly, we might ask whether 'Total Pay & Benefits' is truly the sum of 'Total Pay' and 'Benefits'.
(salaries.loc[:, ['Total Pay', 'Benefits']].sum(axis=1) == salaries.loc[:, 'Total Pay & Benefits']).all()
True
salaries['Total Pay & Benefits'].plot(kind='hist', density=False, bins=20, ec='w');
salaries.plot(kind='scatter', x='Base Pay', y='Overtime Pay');
pd.plotting.scatter_matrix(salaries[['Base Pay', 'Overtime Pay', 'Total Pay']], figsize=(8, 8));
Think of your own questions about the dataset, and try and answer them!
pandas relies heavily on numpy. An understanding of how data types work in both will allow you to write more efficient and bug-free code.pandas documentation for more details).pandas methods return copies of Series/DataFrames. Be careful when using techniques that modify values in-place.