import pandas as pd
import numpy as np
import os
pd.set_option('display.max_rows', 9)
pandas
.get
vs. []
.loc
and iloc
.Remember, we are not going to cover every single detail! The pandas
documentation will be your friend.
pandas
🐼¶pandas
¶pandas
is the Python library for tabular data manipulation.pandas
was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.pandas
, wanted a library which would allow everything to be done in Python.pandas
data structures¶There are three key data structures at the core of pandas
:
pandas
and related libraries¶We've already run this at the top of the notebook, so we won't repeat it here. But pandas
is almost always imported in conjunction with numpy
:
import pandas as pd
import numpy as np
To refresh our memory on the basics of pandas
, let's work with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.
pd.read_csv
to load a DataFrame from file.os.path.join('x', 'y.csv')
evaluates to 'x/y.csv'
on Unix machines and 'x\y.csv'
on Windows.schools_path = os.path.join('data', 'california_universities.csv')
schools = pd.read_csv(schools_path)
schools
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... | ... |
28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
To extract the first or last few rows of a DataFrame, use the head
or tail
methods.
schools.head()
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
4 | California State University, Dominguez Hills | Carson | Los Angeles | 15,741 | 1960 |
schools.tail(2)
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
The shape
attribute returns the DataFrame's number of rows and columns.
schools.shape
(32, 5)
Each row and column of a DataFrame is a Series.
Index
types.schools
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... | ... |
28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
# Index is 0, 1, 2, ..., 31
schools['City']
0 Arcata 1 Bakersfield 2 Berkeley 3 Camarillo ... 28 Santa Cruz 29 Seaside-Marina 30 Turlock 31 Vallejo Name: City, Length: 32, dtype: object
# The default index of a DataFrame is 0, 1, 2, 3, ...
schools.index
RangeIndex(start=0, stop=32, step=1)
# Index is 'Name', 'City', 'County', ...
schools.iloc[-5]
Name University of California, Santa Barbara City Santa Barbara County Santa Barbara Enrollment 24,346 Founded 1891 Name: 27, dtype: object
schools.columns
Index(['Name', 'City', 'County', 'Enrollment', 'Founded'], dtype='object')
The order of the rows in schools
does not seem to be meaningful right now. To sort by a column, use the sort_values
method. Like most DataFrame and Series methods, sort_values
returns a new DataFrame, and doesn't modify the original.
schools
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... | ... |
28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
schools.sort_values('Founded')
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
24 | San Jose State University | San Jose | Santa Clara | 32,828 | 1857 |
2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
13 | University of California, Los Angeles | Los Angeles | Los Angeles | 45,428 | 1882 |
5 | California State University, Chico | Chico | Butte | 17,488 | 1887 |
... | ... | ... | ... | ... | ... |
26 | California State University San Marcos | San Marcos | San Diego | 14,511 | 1988 |
29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
14 | University of California, Merced | Merced | Merced | 8,544 | 2005 |
32 rows × 5 columns
schools.sort_values('Name', ascending=False)
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
27 | University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
21 | University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
17 | University of California, Riverside | Riverside | Riverside | 23,278 | 1954 |
... | ... | ... | ... | ... | ... |
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
16 | California State Polytechnic University, Pomona | Pomona | Los Angeles | 26,443 | 1938 |
25 | California Polytechnic State University | San Luis Obispo | San Luis Obispo | 21,812 | 1901 |
32 rows × 5 columns
# Why isn't this sorting correctly?
schools.sort_values('Enrollment')
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
26 | California State University San Marcos | San Marcos | San Diego | 14,511 | 1988 |
... | ... | ... | ... | ... | ... |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
14 | University of California, Merced | Merced | Merced | 8,544 | 2005 |
18 | Sonoma State University | Rohnert Park | Sonoma | 9,201 | 1960 |
32 rows × 5 columns
Think of each row's index as its unique identifier or name. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the set_index
method.
# By reassigning schools, our changes will persist.
schools = schools.set_index('Name')
schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Only 4 columns now!
schools.shape
(32, 4)
babypandas
👶🐼¶babypandas
, you selected columns using the .get
method..get
also works in pandas
, but it is not idiomatic – people don't usually use it.schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.get('County')
Name Humboldt State University Humboldt California State University, Bakersfield Kern University of California, Berkeley Alameda California State University Channel Islands Ventura ... University of California, Santa Cruz Santa Cruz California State University, Monterey Bay Monterey California State University, Stanislaus Stanislaus California State University Maritime Academy Solano Name: County, Length: 32, dtype: object
# This doesn't error, but sometimes we'd like it to.
schools.get('State')
[]
¶pandas
is by using the []
operator.schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Returns a Series.
schools['City']
Name Humboldt State University Arcata California State University, Bakersfield Bakersfield University of California, Berkeley Berkeley California State University Channel Islands Camarillo ... University of California, Santa Cruz Santa Cruz California State University, Monterey Bay Seaside-Marina California State University, Stanislaus Turlock California State University Maritime Academy Vallejo Name: City, Length: 32, dtype: object
# Returns a DataFrame.
schools[['Founded', 'County']]
Founded | County | |
---|---|---|
Name | ||
Humboldt State University | 1913 | Humboldt |
California State University, Bakersfield | 1965 | Kern |
University of California, Berkeley | 1869 | Alameda |
California State University Channel Islands | 2002 | Ventura |
... | ... | ... |
University of California, Santa Cruz | 1965 | Santa Cruz |
California State University, Monterey Bay | 1994 | Monterey |
California State University, Stanislaus | 1957 | Stanislaus |
California State University Maritime Academy | 1929 | Solano |
32 rows × 2 columns
# 🤔
schools[['Founded']]
Founded | |
---|---|
Name | |
Humboldt State University | 1913 |
California State University, Bakersfield | 1965 |
University of California, Berkeley | 1869 |
California State University Channel Islands | 2002 |
... | ... |
University of California, Santa Cruz | 1965 |
California State University, Monterey Bay | 1994 |
California State University, Stanislaus | 1957 |
California State University Maritime Academy | 1929 |
32 rows × 1 columns
# Names are stored in the index, which is not a column!
schools['Name']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Name' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) /tmp/nix-shell.zQepNE/ipykernel_7072/3195502009.py in <module> 1 # Names are stored in the index, which is not a column! ----> 2 schools['Name'] /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] /nix/store/87mah7hf4cl4q9m9skkl8h6jd3j32s1n-python3-3.9.6-env/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 'Name'
schools.index
Index(['Humboldt State University', 'California State University, Bakersfield', 'University of California, Berkeley', 'California State University Channel Islands', 'California State University, Dominguez Hills', 'California State University, Chico', 'University of California, Davis', 'California State University, Fresno', 'California State University, Fullerton', 'California State University, East Bay', 'University of California, Irvine', 'California State University, Long Beach', 'California State University, Los Angeles', 'University of California, Los Angeles', 'University of California, Merced', 'California State University, Northridge', 'California State Polytechnic University, Pomona', 'University of California, Riverside', 'Sonoma State University', 'California State University, Sacramento', 'California State University, San Bernardino', 'University of California, San Diego', 'San Diego State University', 'San Francisco State University', 'San Jose State University', 'California Polytechnic State University', 'California State University San Marcos', 'University of California, Santa Barbara', 'University of California, Santa Cruz', 'California State University, Monterey Bay', 'California State University, Stanislaus', 'California State University Maritime Academy'], dtype='object', name='Name')
schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# What's the most common county?
schools['County'].mode()
0 Los Angeles dtype: object
# How many unique counties are represented?
schools['County'].nunique()
23
# What's the distribution of counties?
schools['County'].value_counts()
Los Angeles 6 San Diego 3 Alameda 2 Orange 2 .. Yolo 1 Butte 1 Ventura 1 Solano 1 Name: County, Length: 23, dtype: int64
# What's the mean of the 'Founded' column?
schools['Founded'].mean()
1938.625
# Tell me more about the 'Founded' column.
schools['Founded'].describe()
count 32.000000 mean 1938.625000 std 38.954688 min 1857.000000 25% 1904.000000 50% 1951.500000 75% 1961.250000 max 2005.000000 Name: Founded, dtype: float64
# Sort the 'Founded' column. Note that here we're using sort_values on a Series, not a DataFrame!
schools['Founded'].sort_values()
Name San Jose State University 1857 University of California, Berkeley 1869 University of California, Los Angeles 1882 California State University, Chico 1887 ... California State University San Marcos 1988 California State University, Monterey Bay 1994 California State University Channel Islands 2002 University of California, Merced 2005 Name: Founded, Length: 32, dtype: int64
loc
to select rows using row labels¶If df
is a DataFrame, then:
df.loc[idx]
returns the Series whose index is idx
.df.loc[idx_list]
returns a DataFrame containing the rows whose indexes are in idx_list
.schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.loc['University of California, San Diego']
City San Diego County San Diego Enrollment 38,798 Founded 1960 Name: University of California, San Diego, dtype: object
schools.loc[['University of California, San Diego', 'San Diego State University']]
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
San Diego State University | San Diego | San Diego | 34,881 | 1897 |
loc
operator also supports Boolean sequences (lists, arrays, Series) as input. True
.schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
random_bools = np.random.choice([True, False], 32)
random_bools
array([ True, True, True, True, False, True, True, False, True, True, False, True, False, True, True, False, True, False, True, False, True, False, True, True, False, True, False, True, False, True, True, True])
schools.loc[random_bools]
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
21 rows × 4 columns
loc
operator to query a DataFrame.schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools['Founded'] > 1998
Name Humboldt State University False California State University, Bakersfield False University of California, Berkeley False California State University Channel Islands True ... University of California, Santa Cruz False California State University, Monterey Bay False California State University, Stanislaus False California State University Maritime Academy False Name: Founded, Length: 32, dtype: bool
schools.loc[schools['Founded'] > 1998]
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
University of California, Merced | Merced | Merced | 8,544 | 2005 |
schools.loc[schools.index.str.contains('University of California')]
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
University of California, Davis | Davis | Yolo | 39,152 | 1905 |
University of California, Irvine | Irvine | Orange | 35,220 | 1965 |
University of California, Los Angeles | Los Angeles | Los Angeles | 45,428 | 1882 |
University of California, Merced | Merced | Merced | 8,544 | 2005 |
University of California, Riverside | Riverside | Riverside | 23,278 | 1954 |
University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
# Using loc is not strictly necessary when indexing with Boolean sequences.
schools[schools.index.str.contains('University of California')]
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
University of California, Davis | Davis | Yolo | 39,152 | 1905 |
University of California, Irvine | Irvine | Orange | 35,220 | 1965 |
University of California, Los Angeles | Los Angeles | Los Angeles | 45,428 | 1882 |
University of California, Merced | Merced | Merced | 8,544 | 2005 |
University of California, Riverside | Riverside | Riverside | 23,278 | 1954 |
University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
University of California, Santa Barbara | Santa Barbara | Santa Barbara | 24,346 | 1891 |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
Note that because we set the index to 'Name'
earlier, we can select rows based on school names without having to query.
schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Series!
schools.loc['University of California, San Diego']
City San Diego County San Diego Enrollment 38,798 Founded 1960 Name: University of California, San Diego, dtype: object
If 'Name'
was instead a column, then we'd need to query to access information about a particular school.
schools_reset = schools.reset_index()
schools_reset
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
0 | Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
1 | California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
2 | University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
3 | California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... | ... |
28 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
29 | California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
30 | California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
31 | California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 5 columns
# DataFrame!
schools_reset[schools_reset['Name'] == 'University of California, San Diego']
Name | City | County | Enrollment | Founded | |
---|---|---|---|---|---|
21 | University of California, San Diego | San Diego | San Diego | 38,798 | 1960 |
Write an expression that evaluates to the number of UC schools founded after 1950.
schools.loc[(schools.index.str.contains('University of California')) & (schools['Founded']>1950)].shape[0]
5
So far, we used []
to select columns and loc
to select rows.
For instance, to find the cities for all schools in San Diego county:
schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.loc[schools['County'] == 'San Diego']['City']
Name University of California, San Diego San Diego San Diego State University San Diego California State University San Marcos San Marcos Name: City, dtype: object
loc
can also be used to select both rows and columns. The general pattern is:
df.loc[<row selector>, <column selector>]
Examples:
df.loc[idx_list, col_list]
returns a DataFrame containing the rows in idx_list
and columns in col_list
.df.loc[bool_arr, col_list]
returns a DataFrame contaning the rows for which bool_arr
is True
and columns in col_list
.:
syntax (e.g. 'City': 'Enrollment'
).There are many, many more – see the pandas
documentation for more.
schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
# Find the city and enrollment for all schools in San Diego county.
schools.loc[schools['County'] == 'San Diego', ['City', 'Enrollment']]
City | Enrollment | |
---|---|---|
Name | ||
University of California, San Diego | San Diego | 38,798 |
San Diego State University | San Diego | 34,881 |
California State University San Marcos | San Marcos | 14,511 |
# Find the county, enrollment, and year founded for all schools founded after 1950.
schools.loc[schools['Founded'] > 1950, 'County':]
County | Enrollment | Founded | |
---|---|---|---|
Name | |||
California State University, Bakersfield | Kern | 10,493 | 1965 |
California State University Channel Islands | Ventura | 7,095 | 2002 |
California State University, Dominguez Hills | Los Angeles | 15,741 | 1960 |
California State University, Fullerton | Orange | 39,774 | 1957 |
... | ... | ... | ... |
California State University San Marcos | San Diego | 14,511 | 1988 |
University of California, Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Stanislaus | 10,214 | 1957 |
16 rows × 3 columns
iloc
!¶iloc
stands for "integer location".iloc
is like loc
, but it selects rows and columns based off of integer positions only.schools
City | County | Enrollment | Founded | |
---|---|---|---|---|
Name | ||||
Humboldt State University | Arcata | Humboldt | 7,774 | 1913 |
California State University, Bakersfield | Bakersfield | Kern | 10,493 | 1965 |
University of California, Berkeley | Berkeley | Alameda | 42,519 | 1869 |
California State University Channel Islands | Camarillo | Ventura | 7,095 | 2002 |
... | ... | ... | ... | ... |
University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19,700 | 1965 |
California State University, Monterey Bay | Seaside-Marina | Monterey | 7,079 | 1994 |
California State University, Stanislaus | Turlock | Stanislaus | 10,214 | 1957 |
California State University Maritime Academy | Vallejo | Solano | 1,017 | 1929 |
32 rows × 4 columns
schools.iloc[3:7, :-1]
City | County | Enrollment | |
---|---|---|---|
Name | |||
California State University Channel Islands | Camarillo | Ventura | 7,095 |
California State University, Dominguez Hills | Carson | Los Angeles | 15,741 |
California State University, Chico | Chico | Butte | 17,488 |
University of California, Davis | Davis | Yolo | 39,152 |
iloc
is often most useful when we sort first. For instance, to find the enrollment of the youngest school in the dataset:
schools.sort_values('Founded', ascending=False)['Enrollment'].iloc[0]
'8,544'
# Finding the name involves sorting, but not iloc.
schools.sort_values('Founded', ascending=False).index[0]
'University of California, Merced'
Consider the DataFrame below.
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
1 | 1 | |
---|---|---|
0 | fee | fo |
1 | fi | fum |
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own.
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1,1]
pandas
is the library for tabular data manipulation in Python.pandas
: DataFrame, Series, and Index.pandas
documentation for tips.pandas
and numpy
work together (and when they disagree).