In [1]:

# Run this cell to set up packages for lecture.
from lec06_imports import *

Lecture 6 – Grouping and Data Visualization 📈¶

DSC 10, Winter 2024¶

Announcements¶

Quiz 1 is today in discussion section.
- It is a 20 minute paper-based quiz administered at the beginning of discussion.
  - Afterwards, the TA can help with Homework 1 questions
- Seats are assigned. Search your email for DSC 10 Quiz 1 Seating Assignment or look up your seat here.
Homework 1 is due Thursday at 11:59PM.
Make sure to try the "challenge problems" at the end of Friday's lecture and watch this walkthrough video for solutions.

Agenda¶

Grouping.
Why visualize?
Terminology.
Scatter plots.
Line plots.
Bar charts.

Aside: Keyboard shortcuts¶

There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).

Particularly useful shortcuts:

Action	Keyboard shortcut
Run cell + jump to next cell	SHIFT + ENTER
Save the notebook	CTRL/CMD + S
Create new cell above/below	A/B
Delete cell	DD

Grouping¶

Question: Which region is the most populated?¶

In [2]:

states = bpd.read_csv('data/states.csv').set_index('State')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states

Out[2]:

	Region	Capital City	Population	Land Area	Party	Density
State
Alabama	South	Montgomery	5024279	50645	Republican	99.21
Alaska	West	Juneau	733391	570641	Republican	1.29
Arizona	West	Phoenix	7151502	113594	Republican	62.96
...	...	...	...	...	...	...
West Virginia	South	Charleston	1793716	24038	Republican	74.62
Wisconsin	Midwest	Madison	5893718	54158	Republican	108.82
Wyoming	West	Cheyenne	576851	97093	Republican	5.94

50 rows × 6 columns

Organizing states by region¶

We can find the total population of any one region using the tools we already have.

In [3]:

states[states.get('Region') == 'West'].get('Population').sum()

Out[3]:

78588572

In [4]:

states[states.get('Region') == 'Midwest'].get('Population').sum()

Out[4]:

68985454

But can we find the total population of every region all at the same time, without writing very similar code multiple times? Yes, there is a better way!

A new method: `.groupby`¶

Observe what happens when we use the .groupby method on states with the argument 'Region'.

In [5]:

states.groupby('Region').sum()

Out[5]:

	Population	Land Area	Density
Region
Midwest	68985454	750524	1298.78
Northeast	57609148	161912	4957.49
South	125576562	868356	3189.37
West	78588572	1751054	881.62

These populations (for the 'West' and 'Midwest') match the ones we found on the previous slide, except now we get the populations for all regions at the same time. What just happened? 🤯

An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets, shown below.

	Species	Color	Weight	Age
0	dog	black	40	5.0
1	cat	golden	15	8.0
2	cat	black	20	9.0
3	dog	white	80	2.0
4	dog	golden	25	0.5
5	hamster	golden	1	3.0

Let's see what happens under the hood when we run pets.groupby('Species').mean().

In [6]:

show_grouping_animation()

Let's try it out!¶

In [7]:

pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets

Out[7]:

	Species	Color	Weight	Age
0	dog	black	40	5.0
1	cat	golden	15	8.0
2	cat	black	20	9.0
3	dog	white	80	2.0
4	dog	golden	25	0.5
5	hamster	golden	1	3.0

In [8]:

pets.groupby('Species').mean()

Out[8]:

	Weight	Age
Species
cat	17.50	8.5
dog	48.33	2.5
hamster	1.00	3.0

It takes several steps to go from the original pets DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.

Back to states: which region is most populated?¶

In [9]:

states.groupby('Region').sum()

Out[9]:

	Population	Land Area	Density
Region
Midwest	68985454	750524	1298.78
Northeast	57609148	161912	4957.49
South	125576562	868356	3189.37
West	78588572	1751054	881.62

In [10]:

# Note the use of .index – remember, the index isn't a column!
(
    states
    .groupby('Region')
    .sum()
    .sort_values(by='Population', ascending=False)
    .index[0]
)

Out[10]:

'South'

Using `.groupby` in general¶

In short, .groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values from different rows with the same value in the specified column.

To use .groupby:

Choose a column to group by.
- .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
- In the resulting DataFrame, there will be one row for every unique value in that column.

Choose an aggregation method.
- The aggregation method will be applied within each group.
- The aggregation method is applied individually to each column.
  - If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output.
- Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observations on grouping¶

After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.

In [11]:

states

Out[11]:

	Region	Capital City	Population	Land Area	Party	Density
State
Alabama	South	Montgomery	5024279	50645	Republican	99.21
Alaska	West	Juneau	733391	570641	Republican	1.29
Arizona	West	Phoenix	7151502	113594	Republican	62.96
...	...	...	...	...	...	...
West Virginia	South	Charleston	1793716	24038	Republican	74.62
Wisconsin	Midwest	Madison	5893718	54158	Republican	108.82
Wyoming	West	Cheyenne	576851	97093	Republican	5.94

50 rows × 6 columns

In [12]:

states.groupby('Region').sum()

Out[12]:

	Population	Land Area	Density
Region
Midwest	68985454	750524	1298.78
Northeast	57609148	161912	4957.49
South	125576562	868356	3189.37
West	78588572	1751054	881.62

The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, the column will disappear. 🐇🎩

Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.

In [13]:

states.groupby('Region').max()

Out[13]:

	Capital City	Population	Land Area	Party	Density
Region
Midwest	Topeka	12812508	81759	Republican	288.77
Northeast	Trenton	20201249	47126	Republican	1263.12
South	Tallahassee	29145505	261232	Republican	636.37
West	Santa Fe	39538223	570641	Republican	253.81

In [14]:

12812508 / 81759 == 288.77

Out[14]:

False

The column names don't make sense after grouping with the .count() aggregation method.

In [15]:

states.groupby('Region').count()

Out[15]:

	Capital City	Population	Land Area	Party	Density
Region
Midwest	12	12	12	12	12
Northeast	9	9	9	9	9
South	16	16	16	16	16
West	13	13	13	13	13

Dropping, renaming, and reordering columns¶

Consider dropping unneeded columns and renaming columns as follows:

Use .assign to create a new column containing the same values as the old column(s).
Use .drop(columns=list_of_column_labels) to drop the old column(s).
- Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!

In [16]:

states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
                    States=states_by_region.get('Capital City')
                    ).get(['States'])
states_by_region

Out[16]:

	States
Region
Midwest	12
Northeast	9
South	16
West	13

Why visualize?¶

Little Women¶

In Lecture 1, we were able to answer questions about the plot of Little Women without having to read the novel and without having to understand Python code. Some of those questions included:

Who is the main character?
Which pair of characters gets married in Chapter 35?

We answered these questions from a data visualization alone!

In [17]:

bpd.read_csv('data/lw_counts.csv').plot(x='Chapter');

Napoleon's March¶

"Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812." (source)

Why visualize?¶

Computers are better than humans at crunching numbers, but humans are better at identifying visual patterns.

Visualizations allow us to understand lots of data quickly – they make it easier to spot trends and communicate our results with others.

There are many types of visualizations; in this class, we'll look at scatter plots, line plots, bar charts, and histograms, but there are many others.
- The right choice depends on the type of data.

Terminology¶

Individuals and variables¶

Individual (row): Person/place/thing for which data is recorded. Also called an observation.

Variable (column): Something that is recorded for each individual. Also called a feature.

Types of variables¶

There are two main types of variables:

Numerical: It makes sense to do arithmetic with the values.
Categorical: Values fall into categories, that may or may not have some order to them.

Note that here, "variable" does not mean a variable in Python, but rather it means a column in a DataFrame.

Examples of numerical variables¶

Salaries of NBA players 🏀.
- Individual: An NBA player.
- Variable: Their salary.

Company's annual profit 💰.
- Individual: A company.
- Variable: Its annual profit.

Flu shots administered per day 💉.
- Individual: Date.
- Variable: Number of flu shots administered on that date.

Examples of categorical variables¶

Movie genres 🎬.
- Individual: A movie.
- Variable: Its genre.

Zip codes 🏠.
- Individual: US resident.
- Variable: Zip code.
  - Even though they look like numbers, zip codes are categorical (arithmetic doesn't make sense).

Level of prior programming experience for students in DSC 10 🧑‍🎓.
- Individual: Student in DSC 10.
- Variable: Their level of prior programming experience, e.g. none, low, medium, or high.
  - There is an order to these categories!

Concept Check ✅ – Answer at cc.dsc10.com ¶

Which of these is not a numerical variable?

A. Fuel economy in miles per gallon.

B. Number of quarters at UCSD.

C. College at UCSD (Sixth, Seventh, etc).

D. Bank account number.

E. More than one of these are not numerical variables.

Types of visualizations¶

The type of visualization we create depends on the kinds of variables we're visualizing.

Scatter plot: Numerical vs. numerical.
Line plot: Sequential numerical (time) vs. numerical.
Bar chart: Categorical vs. numerical.
Histogram: Numerical.
- Will cover next time.

We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

Scatter plots¶

The data: exoplanets discovered by NASA 🪐¶

An exoplanet is a planet outside our solar system. NASA has discovered over 5,000 exoplanets so far in its search for signs of life beyond Earth. 👽

Column	Contents

'Distance'| Distance from Earth, in light years. 'Magnitude'| Apparent magnitude, which measures brightness in such a way that brighter objects have lower values. 'Type'| Categorization of planet based on its composition and size. 'Year'| When the planet was discovered. 'Detection'| The method of detection used to discover the planet. 'Mass'| The ratio of the planet's mass to Earth's mass. 'Radius'| The ratio of the planet's radius to Earth's radius.

In [18]:

exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo

Out[18]:

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
11 Comae Berenices b	304.0	4.72	Gas Giant	2007	Radial Velocity	6165.90	11.88
11 Ursae Minoris b	409.0	5.01	Gas Giant	2009	Radial Velocity	4684.81	11.99
14 Andromedae b	246.0	5.23	Gas Giant	2008	Radial Velocity	1525.58	12.65
...	...	...	...	...	...	...	...
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91
YZ Ceti c	12.0	12.07	Super Earth	2017	Radial Velocity	1.14	1.05
YZ Ceti d	12.0	12.07	Super Earth	2017	Radial Velocity	1.09	1.03

5043 rows × 7 columns

Scatter plots¶

What is the relationship between 'Distance' and 'Magnitude'?

In [19]:

exo.plot(kind='scatter', x='Distance', y='Magnitude');

Further planets have greater 'Magnitude' (meaning they are less bright), which makes sense.
The data appears curved because 'Magnitude' is measured on a logarithmic scale. A decrease of one unit in 'Magnitude' corresponds to a 2.5 times increase in brightness.

Scatter plots¶

Scatter plots visualize the relationship between two numerical variables.

To create one from a DataFrame df, use

df.plot(
  kind='scatter', 
  x=x_column_for_horizontal, 
  y=y_column_for_vertical
)

The resulting scatter plot has one point per row of df.
If you put a semicolon after a call to .plot, it will hide the weird text output that displays.

Zooming in 🔍¶

The majority of exoplanets are less than 10,000 light years away; if we'd like to zoom in on just these exoplanets, we can query before plotting.

In [20]:

exo[exo.get('Distance') < 10000].plot(kind='scatter', x='Distance', y='Magnitude');

Line plots 📉¶

Line plots¶

How has the 'Magnitude' of newly discovered exoplanets changed over time?

In [21]:

# There were multiple exoplanets discovered each year.
# What operation can we apply to this DataFrame so that there is one row per year?
exo

Out[21]:

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
11 Comae Berenices b	304.0	4.72	Gas Giant	2007	Radial Velocity	6165.90	11.88
11 Ursae Minoris b	409.0	5.01	Gas Giant	2009	Radial Velocity	4684.81	11.99
14 Andromedae b	246.0	5.23	Gas Giant	2008	Radial Velocity	1525.58	12.65
...	...	...	...	...	...	...	...
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91
YZ Ceti c	12.0	12.07	Super Earth	2017	Radial Velocity	1.14	1.05
YZ Ceti d	12.0	12.07	Super Earth	2017	Radial Velocity	1.09	1.03

5043 rows × 7 columns

Let's calculate the average 'Magnitude' of all exoplanets discovered in each 'Year'.

In [22]:

exo.groupby('Year').mean()

Out[22]:

	Distance	Magnitude	Mass	Radius
Year
1995	50.00	5.45	146.20	13.97
1996	51.33	5.12	1020.67	13.09
1997	57.00	5.41	332.10	13.53
...	...	...	...	...
2021	1944.22	13.01	255.42	4.44
2022	508.61	10.62	943.16	6.77
2023	451.89	12.09	162.78	7.12

29 rows × 4 columns

In [23]:

exo.groupby('Year').mean().plot(kind='line', y='Magnitude');

It looks like the brightest planets were discovered first, which makes sense.
NASA's Kepler space telescope began its nine-year mission in 2009, leading to a boom in the discovery of exoplanets.

Line plots¶

Line plots show trends in numerical variables over time.

To create one from a DataFrame df, use

df.plot(
  kind='line', 
  x=x_column_for_horizontal, 
  y=y_column_for_vertical
)

To use the index as the x-axis, omit the x= argument.
- This doesn't work for scatterplots, but it works for most other plot types.

Extra video on line plots¶

If you're curious how line plots work under the hood, watch this video we made a few quarters ago.

In [24]:

YouTubeVideo('glzZ04D1kDg')

Out[24]:

Bar charts 📊¶

Bar charts¶

How big are each of the different 'Type's of exoplanets, on average?

In [25]:

types = exo.groupby('Type').mean()
types

Out[25]:

	Distance	Magnitude	Year	Mass	Radius
Type
Gas Giant	1096.40	10.30	2013.73	1472.39	12.74
Neptune-like	2189.02	13.52	2016.59	15.28	3.11
Super Earth	1916.26	13.85	2016.43	5.81	1.58
Terrestrial	1373.60	13.45	2016.37	1.62	0.85

In [26]:

types.plot(kind='barh', y='Radius');

In [27]:

types.plot(kind='barh', y='Mass');

It looks like the 'Gas Giant's are aptly named!

Bar charts¶

Bar charts visualize the relationship between a categorical variable and a numerical variable.
In a bar chart...
- The thickness and spacing of bars is arbitrary.
- The order of the categorical labels doesn't matter.

To create one from a DataFrame df, use

df.plot(
  kind='barh', 
  x=categorical_column_name, 
  y=numerical_column_name
)

The "h" in 'barh' stands for "horizontal".
- It's easier to read labels this way.
Note that in the previous chart, we set y='Mass' even though mass is measured by x-axis length.

Bar charts and sorting¶

What are the most popular 'Detection' methods for discovering exoplanets?

In [28]:

# Count how many exoplanets are discovered by each detection method.
popular_detection = exo.groupby('Detection').count()
popular_detection

Out[28]:

	Distance	Magnitude	Type	Year	Mass	Radius
Detection
Astrometry	1	1	1	1	1	1
Direct Imaging	50	50	50	50	50	50
Disk Kinematics	1	1	1	1	1	1
...	...	...	...	...	...	...
Radial Velocity	1019	1019	1019	1019	1019	1019
Transit	3914	3914	3914	3914	3914	3914
Transit Timing Variations	23	23	23	23	23	23

11 rows × 6 columns

In [29]:

# Give columns more meaningful names and eliminate redundancy.
popular_detection = (popular_detection.assign(Count=popular_detection.get('Distance'))
                                      .get(['Count'])
                                      .sort_values(by='Count', ascending=False)
                    )
popular_detection

Out[29]:

	Count
Detection
Transit	3914
Radial Velocity	1019
Direct Imaging	50
...	...
Astrometry	1
Disk Kinematics	1
Pulsar Timing	1

11 rows × 1 columns

In [30]:

# Notice that the bars appear in the opposite order relative to the DataFrame.
popular_detection.plot(kind='barh', y='Count');

In [31]:

# Change "barh" to "bar" to get a vertical bar chart. 
# These are harder to read, but the bars do appear in the same order as the DataFrame.
popular_detection.plot(kind='bar', y='Count');

Multiple plots on the same axes¶

Can we look at both the average 'Magnitude' and the average 'Radius' for each 'Type' at the same time?

In [32]:

types.get(['Magnitude', 'Radius']).plot(kind='barh');

How did we do that?

Overlaying plots¶

When calling .plot, if we omit the y=column_name argument, all other columns are plotted.

In [33]:

types

Out[33]:

	Distance	Magnitude	Year	Mass	Radius
Type
Gas Giant	1096.40	10.30	2013.73	1472.39	12.74
Neptune-like	2189.02	13.52	2016.59	15.28	3.11
Super Earth	1916.26	13.85	2016.43	5.81	1.58
Terrestrial	1373.60	13.45	2016.37	1.62	0.85

In [34]:

types.plot(kind='barh');

Selecting multiple columns at once¶

To select multiple columns, use .get([column_1, ..., column_k]).
Passing a list of column labels to .get returns a DataFrame.
- .get([column_name]) will return a DataFrame with just one column!

In [35]:

types

Out[35]:

	Distance	Magnitude	Year	Mass	Radius
Type
Gas Giant	1096.40	10.30	2013.73	1472.39	12.74
Neptune-like	2189.02	13.52	2016.59	15.28	3.11
Super Earth	1916.26	13.85	2016.43	5.81	1.58
Terrestrial	1373.60	13.45	2016.37	1.62	0.85

In [36]:

types.get(['Magnitude', 'Radius'])

Out[36]:

	Magnitude	Radius
Type
Gas Giant	10.30	12.74
Neptune-like	13.52	3.11
Super Earth	13.85	1.58
Terrestrial	13.45	0.85

In [37]:

types.get(['Magnitude', 'Radius']).plot(kind='barh');

Summary¶

Visualizations make it easy to extract patterns from datasets.
There are two main types of variables: categorical and numerical.
The types of the variables we're visualizing inform our choice of which type of visualization to use.
Today, we looked at scatter plots, line plots, and bar charts.
Next time: histograms.

Lecture 6 – Grouping and Data Visualization 📈¶

DSC 10, Winter 2024¶

Announcements¶

Agenda¶

Aside: Keyboard shortcuts¶

Grouping¶

Question: Which region is the most populated?¶

Organizing states by region¶

A new method: .groupby¶

An illustrative example: Pets 🐱 🐶🐹¶

Let's try it out!¶

Back to states: which region is most populated?¶

Using .groupby in general¶

Observations on grouping¶

Dropping, renaming, and reordering columns¶

Why visualize?¶

Little Women¶

Napoleon's March¶

Why visualize?¶

Terminology¶

Individuals and variables¶

Types of variables¶

Examples of numerical variables¶

Examples of categorical variables¶

Concept Check ✅ – Answer at cc.dsc10.com¶

Types of visualizations¶

Scatter plots¶

The data: exoplanets discovered by NASA 🪐¶

Scatter plots¶

Scatter plots¶

Zooming in 🔍¶

Line plots 📉¶

Line plots¶

Line plots¶

Extra video on line plots¶

Bar charts 📊¶

Bar charts¶

Bar charts¶

Bar charts and sorting¶

Multiple plots on the same axes¶

Overlaying plots¶

Selecting multiple columns at once¶

Summary¶

Summary¶

A new method: `.groupby`¶

Using `.groupby` in general¶

Concept Check ✅ – Answer at cc.dsc10.com ¶