# Run this cell if you're following along – it just helps make the lectures appear prettier.
import pandas as pd
import numpy as np
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
numpy
that we need to import to use.import numpy as np
np.array
function.temperature_array = np.array([68, 73, 70, 74, 76, 72, 74])
temperature_array
array([68, 73, 70, 74, 76, 72, 74])
temperature_array
array([68, 73, 70, 74, 76, 72, 74])
temperature_array[1]
73
Arrays make it easy to perform the same operation to every element. This behavior is formally known as "broadcasting".
temperature_array
array([68, 73, 70, 74, 76, 72, 74])
# Increase all temperatures by 3 degrees.
temperature_array + 3
array([71, 76, 73, 77, 79, 75, 77])
# Halve all temperatures.
temperature_array / 2
array([34. , 36.5, 35. , 37. , 38. , 36. , 37. ])
# Convert all temperatures to Celsius.
(5 / 9) * (temperature_array - 32)
array([20. , 22.78, 21.11, 23.33, 24.44, 22.22, 23.33])
Note: In none of the above cells did we actually modify temperature_array
! Each of those expressions created a new array.
temperature_array
array([68, 73, 70, 74, 76, 72, 74])
To actually change temperature_array
, we need to reassign it to a new array.
temperature_array = (5 / 9) * (temperature_array - 32)
# Now in Celsius!
temperature_array
array([20. , 22.78, 21.11, 23.33, 24.44, 22.22, 23.33])
a = np.array([4, 5, -1])
b = np.array([2, 3, 2])
a + b
array([6, 8, 1])
a / b
array([ 2. , 1.67, -0.5 ])
a ** 2 + b ** 2
array([20, 34, 5])
Arrays work with a variety of methods, which are functions designed to operate specifically on arrays.
Call these methods using dot notation, e.g. array_name.method()
.
temperature_array.max()
24.444444444444446
temperature_array.mean()
22.460317460317462
We decided to make a Series of TikToks called "A Day in the Life of a Baby Panda". The number of views we've received on these videos are stored in the array views
below.
views = np.array([158, 352, 195, 1423916, 46])
Some questions:
views - views.mean()
array([-284775.4, -284581.4, -284738.4, 1138982.6, -284887.4])
(views - views.mean()).max()
1138982.6
views.max() * 0.03 / 1000
42.717479999999995
We often find ourselves needing to make arrays like this:
day_of_month = np.array([
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31
])
There needs to be an easier way to do this!
np.arange
.np.arange(start, end, step)
. This returns an array such that:start
. By default, start
is 0.step
, until (but excluding) end
. By default, step
is 1.# Start at 0, end before 8, step by 1.
# This will be our most common use-case!
np.arange(8)
array([0, 1, 2, 3, 4, 5, 6, 7])
# Start at 5, end before 10, step by 1.
np.arange(5, 10)
array([5, 6, 7, 8, 9])
# Start at 3, end before 32, step by 5.
np.arange(3, 32, 5)
array([ 3, 8, 13, 18, 23, 28])
The step size in np.arange
can be fractional, or even negative. Predict what arrays will be produced by each line of code below. Then copy each line into a code cell and run it to see if you're right.
np.arange(-3, 2, 0.5)
np.arange(1, -10, -3)
...
Ellipsis
...
Ellipsis
🎉 Congrats! 🎉 You won the lottery 💰. Here's how your payout works: on the first day of September, you are paid \$0.01. Every day thereafter, your pay doubles, so on the second day you're paid \\$0.02, on the third day you're paid \$0.04, on the fourth day you're paid \\$0.08, and so on.
September has 30 days.
Write a one-line expression that uses the numbers 2
and 31
, along with the function np.arange
and the method .sum()
, that computes the total amount in dollars you will be paid in January.
...
Ellipsis
After trying the challenge problem on your own, watch this walkthrough 🎥 video.
pandas
¶pandas
is a Python package that allows us to work with tabular data – that is, data in the form of a table that we might otherwise work with as a spreadsheet (in Excel or Google Sheets).pandas
is the tool for doing data science in Python.pandas
is not so cute...¶babypandas
!¶pandas
called babypandas
.pandas
code. You are learning pandas
!babypandas
🐼¶babypandas
(and pandas
) are called "DataFrames."babypandas
. (We'll need numpy
as well.)import babypandas as bpd
import numpy as np
We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
We can read in a CSV using bpd.read_csv(...)
. Replace the ...
with a path to the CSV file relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file.
# Our CSV file is stored not in the same folder as our notebook, but within a folder called data.
states = bpd.read_csv('data/states.csv')
states
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
Most of the data is self-explanatory, but there are a few things to note:
'Population'
figures come from the 2020 census.'Land Area'
is measured is square miles.'Region'
column places each state in one of four regions, as determined by the US Census Bureau.'Party'
column classifies each state as 'Democratic'
or 'Republican'
based on a political science measurement called the Cook Partisan Voter Index. 'Capital City'
and 'Land Area'
.# This DataFrame has 50 rows and 6 columns.
states
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
Key concepts: Accessing columns, performing calculations with them, and adding new columns.
Question: What is the population density of each state, in people per square mile?
states
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
'Population'
column.'Land Area'
column.'Population'
column¶.get(column_name)
.states
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
states.get('Population')
0 5024279 1 733391 2 7151502 3 3011524 4 39538223 ... 45 8631393 46 7705281 47 1793716 48 5893718 49 576851 Name: Population, Length: 50, dtype: int64
states.get('Population')
0 5024279 1 733391 2 7151502 3 3011524 4 39538223 ... 45 8631393 46 7705281 47 1793716 48 5893718 49 576851 Name: Population, Length: 50, dtype: int64
type(states.get('Population'))
babypandas.bpd.Series
'Land Area'
column and dividing element-wise¶states.get('Land Area')
0 50645 1 570641 2 113594 3 52035 4 155779 ... 45 39490 46 66456 47 24038 48 54158 49 97093 Name: Land Area, Length: 50, dtype: int64
states.get('Population') / states.get('Land Area')
0 99.21 1 1.29 2 62.96 3 57.87 4 253.81 ... 45 218.57 46 115.95 47 74.62 48 108.82 49 5.94 Length: 50, dtype: float64
.assign(name_of_column=data_in_series)
to assign a Series (or array, or list) to a DataFrame.name_of_column
.states.assign(
Density=states.get('Population') / states.get('Land Area')
)
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 7 columns
states
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
states = states.assign(
Density=states.get('Population') / states.get('Land Area')
)
states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 7 columns
Key concept: Computing statistics of columns using Series methods.
Series, like arrays, have helpful methods, including .min()
, .max()
, and .mean()
.
states.get('Density').max()
1263.1212945335872
What state does this correspond to? We'll see how to find out shortly!
Other statistics:
states.get('Density').min()
1.2852055845969708
states.get('Density').mean()
206.54513507096465
states.get('Density').median()
108.31649013462203
# Lots of information at once!
states.get('Density').describe()
count 50.00 mean 206.55 std 274.93 min 1.29 25% 47.06 50% 108.32 75% 224.57 max 1263.12 Name: Density, dtype: float64
Key concepts: Sorting. Accessing using integer positions.
.sort_values(by=column_name)
method to sort.by=
can be omitted, but helps with readability.states.sort_values(by='Density')
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
25 | Montana | West | Helena | 1084225 | 145546 | Republican | 7.45 |
33 | North Dakota | Midwest | Bismarck | 779094 | 69001 | Republican | 11.29 |
40 | South Dakota | Midwest | Pierre | 886667 | 75811 | Republican | 11.70 |
... | ... | ... | ... | ... | ... | ... | ... |
19 | Maryland | South | Annapolis | 6177224 | 9707 | Democratic | 636.37 |
6 | Connecticut | Northeast | Hartford | 3605944 | 4842 | Democratic | 744.72 |
20 | Massachusetts | Northeast | Boston | 7029917 | 7800 | Democratic | 901.27 |
38 | Rhode Island | Northeast | Providence | 1097379 | 1034 | Democratic | 1061.29 |
29 | New Jersey | Northeast | Trenton | 9288994 | 7354 | Democratic | 1263.12 |
50 rows × 7 columns
This sorts, but in ascending order (small to large). The opposite would be nice!
.sort_values(by=column_name, ascending=False)
to sort in descending order.ascending
is an optional argument. If omitted, it will be set to True
by default.ascending=
.ordered_states = states.sort_values(by='Density', ascending=False)
ordered_states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
29 | New Jersey | Northeast | Trenton | 9288994 | 7354 | Democratic | 1263.12 |
38 | Rhode Island | Northeast | Providence | 1097379 | 1034 | Democratic | 1061.29 |
20 | Massachusetts | Northeast | Boston | 7029917 | 7800 | Democratic | 901.27 |
6 | Connecticut | Northeast | Hartford | 3605944 | 4842 | Democratic | 744.72 |
19 | Maryland | South | Annapolis | 6177224 | 9707 | Democratic | 636.37 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | South Dakota | Midwest | Pierre | 886667 | 75811 | Republican | 11.70 |
33 | North Dakota | Midwest | Bismarck | 779094 | 69001 | Republican | 11.29 |
25 | Montana | West | Helena | 1084225 | 145546 | Republican | 7.45 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
50 rows × 7 columns
# We must specify the role of False by using ascending=,
# otherwise Python does not know how to interpret this.
states.sort_values(by='total', False)
File "/var/folders/28/vs8cp38n1r1520g8bhzr4v5h0000gn/T/ipykernel_83524/2213445462.py", line 3 states.sort_values(by='total', False) ^ SyntaxError: positional argument follows keyword argument
.iloc[integer_position]
.iloc
stands for "integer location" and is used to count the rows, starting at 0.ordered_states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
29 | New Jersey | Northeast | Trenton | 9288994 | 7354 | Democratic | 1263.12 |
38 | Rhode Island | Northeast | Providence | 1097379 | 1034 | Democratic | 1061.29 |
20 | Massachusetts | Northeast | Boston | 7029917 | 7800 | Democratic | 901.27 |
6 | Connecticut | Northeast | Hartford | 3605944 | 4842 | Democratic | 744.72 |
19 | Maryland | South | Annapolis | 6177224 | 9707 | Democratic | 636.37 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | South Dakota | Midwest | Pierre | 886667 | 75811 | Republican | 11.70 |
33 | North Dakota | Midwest | Bismarck | 779094 | 69001 | Republican | 11.29 |
25 | Montana | West | Helena | 1084225 | 145546 | Republican | 7.45 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
50 rows × 7 columns
ordered_states.get('State')
29 New Jersey 38 Rhode Island 20 Massachusetts 6 Connecticut 19 Maryland ... 40 South Dakota 33 North Dakota 25 Montana 49 Wyoming 1 Alaska Name: State, Length: 50, dtype: object
# We want the first entry of the Series, which is at "integer location" 0.
ordered_states.get('State').iloc[0]
'New Jersey'
iloc
; we use the integer position counting from the top.iloc
, we get the state with the 30th highest population density, which is not New Jersey.ordered_states.get('State').iloc[29]
'Minnesota'
Key concept: Accessing using row labels.
We know how to get the 'Density'
of all states. How do we find the one that corresponds to Pennsylvania?
states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 7 columns
# Which one is Pennsylvania?
states.get('Density')
0 99.21 1 1.29 2 62.96 3 57.87 4 253.81 ... 45 218.57 46 115.95 47 74.62 48 108.82 49 5.94 Name: Density, Length: 50, dtype: float64
bpd.read_csv('data/states.csv')
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
.set_index(column_name)
.states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 7 columns
states.set_index('State')
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... |
Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
.set_index
returns a new DataFrame; it does not modify the original DataFrame.states
State | Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 7 columns
states = states.set_index('State')
states
Region | Capital City | Population | Land Area | Party | Density | |
---|---|---|---|---|---|---|
State | ||||||
Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
... | ... | ... | ... | ... | ... | ... |
Virginia | South | Richmond | 8631393 | 39490 | Democratic | 218.57 |
Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
# Which one is Pennsylvania? The one whose row label is "Pennsylvania"!
states.get('Density')
State Alabama 99.21 Alaska 1.29 Arizona 62.96 Arkansas 57.87 California 253.81 ... Virginia 218.57 Washington 115.95 West Virginia 74.62 Wisconsin 108.82 Wyoming 5.94 Name: Density, Length: 50, dtype: float64
To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:
.get(column_name)
to extract the entire column as a Series..loc[]
to access the element of a Series with a particular row label.In this class, we'll always first access a column, then a row (but row, then column is also possible).
states.get('Density')
State Alabama 99.21 Alaska 1.29 Arizona 62.96 Arkansas 57.87 California 253.81 ... Virginia 218.57 Washington 115.95 West Virginia 74.62 Wisconsin 108.82 Wyoming 5.94 Name: Density, Length: 50, dtype: float64
states.get('Density').loc['Pennsylvania']
290.60858681804973
.get
the appropriate column as a Series..iloc[]
uses the integer position..loc[]
uses the row label.states.get('Density')
State Alabama 99.21 Alaska 1.29 Arizona 62.96 Arkansas 57.87 California 253.81 ... Virginia 218.57 Washington 115.95 West Virginia 74.62 Wisconsin 108.82 Wyoming 5.94 Name: Density, Length: 50, dtype: float64
states.get('Density').iloc[4]
253.80971119342146
states.get('Density').loc['California']
253.80971119342146
bpd.read_csv
.bpd.read_csv('data/states.csv')
State | Region | Capital City | Population | Land Area | Party | |
---|---|---|---|---|---|---|
0 | Alabama | South | Montgomery | 5024279 | 50645 | Republican |
1 | Alaska | West | Juneau | 733391 | 570641 | Republican |
2 | Arizona | West | Phoenix | 7151502 | 113594 | Republican |
3 | Arkansas | South | Little Rock | 3011524 | 52035 | Republican |
4 | California | West | Sacramento | 39538223 | 155779 | Democratic |
... | ... | ... | ... | ... | ... | ... |
45 | Virginia | South | Richmond | 8631393 | 39490 | Democratic |
46 | Washington | West | Olympia | 7705281 | 66456 | Democratic |
47 | West Virginia | South | Charleston | 1793716 | 24038 | Republican |
48 | Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican |
49 | Wyoming | West | Cheyenne | 576851 | 97093 | Republican |
50 rows × 6 columns
bpd.read_csv('data/states.csv').get('Capital City').loc[35]
'Oklahoma City'
bpd.read_csv('data/states.csv').get('Capital City').iloc[35]
'Oklahoma City'
babypandas
notes, and the babypandas
documentation when working on assignments.We'll frame more questions and learn more DataFrame manipulation techniques to answer them. In particular, we'll learn about querying and grouping.