import pandas as pd
import numpy as np
import os
pd.set_option('display.max_rows', 9)
pandas.get vs. [].loc and iloc.Remember, we are not going to cover every single detail! The pandas documentation will be your friend.
pandas 🐼¶

pandas¶
pandas is the Python library for tabular data manipulation.pandas was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.pandas, wanted a library which would allow everything to be done in Python.pandas data structures¶There are three key data structures at the core of pandas:

pandas and related libraries¶We've already run this at the top of the notebook, so we won't repeat it here. But pandas is almost always imported in conjunction with numpy:
import pandas as pd
import numpy as np
To refresh our memory on the basics of pandas, let's work with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.
pd.read_csv to load a DataFrame from file.os.path.join('x', 'y.csv') evaluates to 'x/y.csv' on Unix machines and 'x\y.csv' on Windows.schools_path = os.path.join('data', 'california_universities.csv')
schools = pd.read_csv(schools_path)
schools
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| 1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| 2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... | ... |
| 28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| 29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| 30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
To extract the first or last few rows of a DataFrame, use the head or tail methods.
schools.head()
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| 1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| 2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| 4 | California State University, Dominguez Hills | Carson | Los Angeles | 15,741 | 1960 |
schools.tail(2)
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
The shape attribute returns the DataFrame's number of rows and columns.
schools.shape
(32, 5)
Each row and column of a DataFrame is a Series.
Index types.schools
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| 1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| 2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... | ... |
| 28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| 29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| 30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
# Index is 0, 1, 2, ..., 31
schools['City']
0 Arcata
1 Bakersfield
2 Berkeley
3 Camarillo
...
28 Santa Cruz
29 Seaside-Marina
30 Turlock
31 Vallejo
Name: City, Length: 32, dtype: object
# The default index of a DataFrame is 0, 1, 2, 3, ...
schools.index
RangeIndex(start=0, stop=32, step=1)
# Index is 'Name', 'City', 'County', ...
schools.iloc[-5]
Name University of California, Santa Barbara City Santa Barbara County Santa Barbara Enrollment 24,346 Founded 1891 Name: 27, dtype: object
schools.columns
Index(['Name', 'City', 'County', 'Enrollment', 'Founded'], dtype='object')
The order of the rows in schools does not seem to be meaningful right now. To sort by a column, use the sort_values method. Like most DataFrame and Series methods, sort_values returns a new DataFrame, and doesn't modify the original.
schools
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| 1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| 2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... | ... |
| 28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| 29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| 30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
schools.sort_values('Founded')
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 24 | San Jose State University | San Jose | Santa Clara | 32,828 | 1857 |
| 2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| 13 | University of California, Los Angeles | Los Angeles | Los Angeles | 45,428 | 1882 |
| 5 | California State University, Chico | Chico | Butte | 17,488 | 1887 |
| ... | ... | ... | ... | ... | ... |
| 26 | California State University San Marcos | San Marcos | San Diego | 14,511 | 1988 |
| 29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| 14 | University of California, Merced | Merced | Merced | 8,544 | 2005 |
32 rows × 5 columns
schools.sort_values('Name', ascending=False)
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| 27 | University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
| 21 | University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
| 17 | University of California, Riverside | Riverside | Riverside | 23,278 | 1954 |
| ... | ... | ... | ... | ... | ... |
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| 16 | California State Polytechnic University, Pomona | Pomona | Los Angeles | 26,443 | 1938 |
| 25 | California Polytechnic State University | San Luis Obispo | San Luis Obispo | 21,812 | 1901 |
32 rows × 5 columns
# Why isn't this sorting correctly?
schools.sort_values('Enrollment')
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
| 30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| 1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| 26 | California State University San Marcos | San Marcos | San Diego | 14,511 | 1988 |
| ... | ... | ... | ... | ... | ... |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| 0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| 14 | University of California, Merced | Merced | Merced | 8,544 | 2005 |
| 18 | Sonoma State University | Rohnert Park | Sonoma | 9,201 | 1960 |
32 rows × 5 columns
Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index method.
# By reassigning schools, our changes will persist.
schools = schools.set_index('Name')
schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Only 4 columns now!
schools.shape
(32, 4)
babypandas 👶🐼¶babypandas, you selected columns using the .get method..get also works in pandas, but it is not idiomatic – people don't usually use it.schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.get('County')
Name
Humboldt State University Humboldt
California State University, Bakersfield Kern
University of California, Berkeley Alameda
California State University Channel Islands Ventura
...
University of California, Santa Cruz Santa Cruz
California State University, Monterey Bay Monterey
California State University, Stanislaus Stanislaus
California State University Maritime Academy Solano
Name: County, Length: 32, dtype: object
# This doesn't error, but sometimes we'd like it to.
schools.get('State')
[]¶pandas is by using the [] operator.schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Returns a Series.
schools['City']
Name
Humboldt State University Arcata
California State University, Bakersfield Bakersfield
University of California, Berkeley Berkeley
California State University Channel Islands Camarillo
...
University of California, Santa Cruz Santa Cruz
California State University, Monterey Bay Seaside-Marina
California State University, Stanislaus Turlock
California State University Maritime Academy Vallejo
Name: City, Length: 32, dtype: object
# Returns a DataFrame.
schools[['Founded', 'County']]
| Founded | County | |
|---|---|---|
| Name | ||
| Humboldt State University | 1913 | Humboldt |
| California State University, Bakersfield | 1965 | Kern |
| University of California, Berkeley | 1869 | Alameda |
| California State University Channel Islands | 2002 | Ventura |
| ... | ... | ... |
| University of California, Santa Cruz | 1965 | Santa Cruz |
| California State University, Monterey Bay | 1994 | Monterey |
| California State University, Stanislaus | 1957 | Stanislaus |
| California State University Maritime Academy | 1929 | Solano |
32 rows × 2 columns
# 🤔
schools[['Founded']]
| Founded | |
|---|---|
| Name | |
| Humboldt State University | 1913 |
| California State University, Bakersfield | 1965 |
| University of California, Berkeley | 1869 |
| California State University Channel Islands | 2002 |
| ... | ... |
| University of California, Santa Cruz | 1965 |
| California State University, Monterey Bay | 1994 |
| California State University, Stanislaus | 1957 |
| California State University Maritime Academy | 1929 |
32 rows × 1 columns
# Names are stored in the index, which is not a column!
schools['Name']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Name' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) /tmp/nix-shell.zQepNE/ipykernel_7072/3195502009.py in <module> 1 # Names are stored in the index, which is not a column! ----> 2 schools['Name'] /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 'Name'
schools.index
Index(['Humboldt State University', 'California State University, Bakersfield',
'University of California, Berkeley',
'California State University Channel Islands',
'California State University, Dominguez Hills',
'California State University, Chico', 'University of California, Davis',
'California State University, Fresno',
'California State University, Fullerton',
'California State University, East Bay',
'University of California, Irvine',
'California State University, Long Beach',
'California State University, Los Angeles',
'University of California, Los Angeles',
'University of California, Merced',
'California State University, Northridge',
'California State Polytechnic University, Pomona',
'University of California, Riverside', 'Sonoma State University',
'California State University, Sacramento',
'California State University, San Bernardino',
'University of California, San Diego', 'San Diego State University',
'San Francisco State University', 'San Jose State University',
'California Polytechnic State University',
'California State University San Marcos',
'University of California, Santa Barbara',
'University of California, Santa Cruz',
'California State University, Monterey Bay',
'California State University, Stanislaus',
'California State University Maritime Academy'],
dtype='object', name='Name')
schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# What's the most common county?
schools['County'].mode()
0 Los Angeles dtype: object
# How many unique counties are represented?
schools['County'].nunique()
23
# What's the distribution of counties?
schools['County'].value_counts()
Los Angeles 6
San Diego 3
Alameda 2
Orange 2
..
Yolo 1
Butte 1
Ventura 1
Solano 1
Name: County, Length: 23, dtype: int64
# What's the mean of the 'Founded' column?
schools['Founded'].mean()
1938.625
# Tell me more about the 'Founded' column.
schools['Founded'].describe()
count 32.000000 mean 1938.625000 std 38.954688 min 1857.000000 25% 1904.000000 50% 1951.500000 75% 1961.250000 max 2005.000000 Name: Founded, dtype: float64
# Sort the 'Founded' column. Note that here we're using sort_values on a Series, not a DataFrame!
schools['Founded'].sort_values()
Name
San Jose State University 1857
University of California, Berkeley 1869
University of California, Los Angeles 1882
California State University, Chico 1887
...
California State University San Marcos 1988
California State University, Monterey Bay 1994
California State University Channel Islands 2002
University of California, Merced 2005
Name: Founded, Length: 32, dtype: int64
loc to select rows using row labels¶If df is a DataFrame, then:
df.loc[idx] returns the Series whose index is idx.df.loc[idx_list] returns a DataFrame containing the rows whose indexes are in idx_list.schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.loc['University of California, San Diego']
City San Diego County San Diego Enrollment 38,798 Founded 1960 Name: University of California, San Diego, dtype: object
schools.loc[['University of California, San Diego', 'San Diego State University']]
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
| San Diego State University | San Diego | San Diego | 34,881 | 1897 |
loc operator also supports Boolean sequences (lists, arrays, Series) as input. True.schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
random_bools = np.random.choice([True, False], 32)
random_bools
array([ True, True, True, True, False, True, True, False, True,
True, False, True, False, True, True, False, True, False,
True, False, True, False, True, True, False, True, False,
True, False, True, True, True])
schools.loc[random_bools]
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
21 rows × 4 columns
loc operator to query a DataFrame.schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools['Founded'] > 1998
Name
Humboldt State University False
California State University, Bakersfield False
University of California, Berkeley False
California State University Channel Islands True
...
University of California, Santa Cruz False
California State University, Monterey Bay False
California State University, Stanislaus False
California State University Maritime Academy False
Name: Founded, Length: 32, dtype: bool
schools.loc[schools['Founded'] > 1998]
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| University of California, Merced | Merced | Merced | 8,544 | 2005 |
schools.loc[schools.index.str.contains('University of California')]
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| University of California, Davis | Davis | Yolo | 39,152 | 1905 |
| University of California, Irvine | Irvine | Orange | 35,220 | 1965 |
| University of California, Los Angeles | Los Angeles | Los Angeles | 45,428 | 1882 |
| University of California, Merced | Merced | Merced | 8,544 | 2005 |
| University of California, Riverside | Riverside | Riverside | 23,278 | 1954 |
| University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
| University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
# Using loc is not strictly necessary when indexing with Boolean sequences.
schools[schools.index.str.contains('University of California')]
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| University of California, Davis | Davis | Yolo | 39,152 | 1905 |
| University of California, Irvine | Irvine | Orange | 35,220 | 1965 |
| University of California, Los Angeles | Los Angeles | Los Angeles | 45,428 | 1882 |
| University of California, Merced | Merced | Merced | 8,544 | 2005 |
| University of California, Riverside | Riverside | Riverside | 23,278 | 1954 |
| University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
| University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
Note that because we set the index to 'Name' earlier, we can select rows based on school names without having to query.
schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Series!
schools.loc['University of California, San Diego']
City San Diego County San Diego Enrollment 38,798 Founded 1960 Name: University of California, San Diego, dtype: object
If 'Name' was instead a column, then we'd need to query to access information about a particular school.
schools_reset = schools.reset_index()
schools_reset
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| 1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| 2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| 3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... | ... |
| 28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| 29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| 30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| 31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
# DataFrame!
schools_reset[schools_reset['Name'] == 'University of California, San Diego']
| Name | City | County | Enrollment | Founded | |
|---|---|---|---|---|---|
| 21 | University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
Write an expression that evaluates to the number of UC schools founded after 1950.
schools.loc[(schools.index.str.contains('University of California')) & (schools['Founded']>1950)].shape[0]
5
So far, we used [] to select columns and loc to select rows.
For instance, to find the cities for all schools in San Diego county:
schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.loc[schools['County'] == 'San Diego']['City']
Name University of California, San Diego San Diego San Diego State University San Diego California State University San Marcos San Marcos Name: City, dtype: object
loc can also be used to select both rows and columns. The general pattern is:
df.loc[<row selector>, <column selector>]
Examples:
df.loc[idx_list, col_list] returns a DataFrame containing the rows in idx_list and columns in col_list.df.loc[bool_arr, col_list] returns a DataFrame contaning the rows for which bool_arr is True and columns in col_list.: syntax (e.g. 'City': 'Enrollment').There are many, many more – see the pandas documentation for more.
schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Find the city and enrollment for all schools in San Diego county.
schools.loc[schools['County'] == 'San Diego', ['City', 'Enrollment']]
| City | Enrollment | |
|---|---|---|
| Name | ||
| University of California, San Diego | San Diego | 38,798 |
| San Diego State University | San Diego | 34,881 |
| California State University San Marcos | San Marcos | 14,511 |
# Find the county, enrollment, and year founded for all schools founded after 1950.
schools.loc[schools['Founded'] > 1950, 'County':]
| County | Enrollment | Founded | |
|---|---|---|---|
| Name | |||
| California State University, Bakersfield | Kern | 10,493 | 1965 |
| California State University Channel Islands | Ventura | 7,095 | 2002 |
| California State University, Dominguez Hills | Los Angeles | 15,741 | 1960 |
| California State University, Fullerton | Orange | 39,774 | 1957 |
| ... | ... | ... | ... |
| California State University San Marcos | San Diego | 14,511 | 1988 |
| University of California, Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Stanislaus | 10,214 | 1957 |
16 rows × 3 columns
iloc!¶iloc stands for "integer location".iloc is like loc, but it selects rows and columns based off of integer positions only.schools
| City | County | Enrollment | Founded | |
|---|---|---|---|---|
| Name | ||||
| Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
| California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
| University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
| California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
| ... | ... | ... | ... | ... |
| University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
| California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
| California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
| California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.iloc[3:7, :-1]
| City | County | Enrollment | |
|---|---|---|---|
| Name | |||
| California State University Channel Islands | Camarillo | Ventura | 7,095 |
| California State University, Dominguez Hills | Carson | Los Angeles | 15,741 |
| California State University, Chico | Chico | Butte | 17,488 |
| University of California, Davis | Davis | Yolo | 39,152 |
iloc is often most useful when we sort first. For instance, to find the enrollment of the youngest school in the dataset:
schools.sort_values('Founded', ascending=False)['Enrollment'].iloc[0]
'8,544'
# Finding the name involves sorting, but not iloc.
schools.sort_values('Founded', ascending=False).index[0]
'University of California, Merced'
Consider the DataFrame below.
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
| 1 | 1 | |
|---|---|---|
| 0 | fee | fo |
| 1 | fi | fum |
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own.
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1,1]
pandas is the library for tabular data manipulation in Python.pandas: DataFrame, Series, and Index.pandas documentation for tips.pandas and numpy work together (and when they disagree).