# Run this cell to set up packages for lecture.
from lec07_imports import *
Today's material is quite theoretical – you can practice with it in Friday's extra practice session!
The type of visualization we create depends on the kinds of variables we're visualizing.
We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.
How often does a variable take on a certain value?
The distribution of a categorical variable can be displayed as a table or bar chart, among other ways!
For example, let's look at the distribution of exoplanet 'Type'
s. To do so, we'll need to group.
exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo
Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|
Name | |||||||
11 Comae Berenices b | 304.0 | 4.72 | Gas Giant | 2007 | Radial Velocity | 6165.90 | 11.88 |
11 Ursae Minoris b | 409.0 | 5.01 | Gas Giant | 2009 | Radial Velocity | 4684.81 | 11.99 |
14 Andromedae b | 246.0 | 5.23 | Gas Giant | 2008 | Radial Velocity | 1525.58 | 12.65 |
... | ... | ... | ... | ... | ... | ... | ... |
YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |
YZ Ceti c | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.14 | 1.05 |
YZ Ceti d | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.09 | 1.03 |
5043 rows × 7 columns
# Remember, when we group and use .count(), the column names aren't meaningful.
type_counts = exo.groupby('Type').count()
type_counts
Distance | Magnitude | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|
Type | ||||||
Gas Giant | 1480 | 1480 | 1480 | 1480 | 1480 | 1480 |
Neptune-like | 1793 | 1793 | 1793 | 1793 | 1793 | 1793 |
Super Earth | 1577 | 1577 | 1577 | 1577 | 1577 | 1577 |
Terrestrial | 193 | 193 | 193 | 193 | 193 | 193 |
# As a result, we could have set y='Magnitude', for example, and gotten the same plot.
type_counts.plot(kind='barh', y='Distance',
legend=False, xlabel='Count', title='Distribution of Exoplanet Types');
Notice the optional title
argument. Some other useful optional arguments are legend
, figsize
, xlabel
, and ylabel
. There are many optional arguments.
It looks like terrestrial exoplanets are the most rare in the dataset. In the last lecture, we also saw that they have the smallest average radius of any 'Type'
.
exo.groupby('Type').mean().get('Radius')
Type Gas Giant 12.74 Neptune-like 3.11 Super Earth 1.58 Terrestrial 0.85 Name: Radius, dtype: float64
Let's look into them further!
terr = exo[exo.get('Type') == 'Terrestrial']
terr
Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|
Name | |||||||
EPIC 201497682 b | 825.0 | 13.95 | Terrestrial | 2019 | Transit | 0.26 | 0.69 |
EPIC 201757695.02 | 1884.0 | 14.97 | Terrestrial | 2020 | Transit | 0.69 | 0.91 |
EPIC 201833600 c | 840.0 | 14.71 | Terrestrial | 2019 | Transit | 0.97 | 1.00 |
... | ... | ... | ... | ... | ... | ... | ... |
TRAPPIST-1 e | 41.0 | 17.02 | Terrestrial | 2017 | Transit | 0.69 | 0.92 |
TRAPPIST-1 h | 41.0 | 17.02 | Terrestrial | 2017 | Transit | 0.33 | 0.76 |
YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |
193 rows × 7 columns
Let's focus on the 'Radius'
column of terr
. To learn more about it, we can use the .describe()
method.
terr.get('Radius').describe()
count 193.00 mean 0.85 std 0.26 ... 50% 0.86 75% 0.92 max 3.13 Name: Radius, Length: 8, dtype: float64
But how do we visualize its distribution?
'Radius'
, a numerical variable¶'Type'
, which is a categorical variable.'Radius'
, which is a numerical variable. To try and see the distribution of 'Radius'
, we need to group by that column and count how many terrestrial planets there are of each radius.
terr_radius = terr.groupby('Radius').count()
terr_radius = (terr_radius
.assign(Count=terr_radius.get('Distance'))
.get(['Count'])
)
terr_radius
Count | |
---|---|
Radius | |
0.37 | 1 |
0.40 | 1 |
0.47 | 1 |
... | ... |
1.80 | 1 |
2.85 | 1 |
3.13 | 1 |
85 rows × 1 columns
terr_radius.plot(kind='bar', y='Count', figsize=(15, 5));
The horizontal axis should be numerical (like a number line), not categorical. There should be more space between certain bars than others.
For instance, the planet with 'Radius'
1.8 is 80% larger than the planet with 'Radius'
1, but they appear to be about the same size here.
Instead of a bar chart, we'll visualize the distribution of a numerical variable with a density histogram. Let's see what a density histogram for 'Radius'
looks like. What do you notice about this visualization?
# Ignore the code for right now.
terr.plot(kind='hist', y='Radius', density=True, bins = np.arange(0, 3.5, 0.25), ec='w');
# There are 7 terrestrial exoplanets with a radius of exactly 1.0,
# but the height of the bar starting at 1.0 is not 7!
terr[terr.get('Radius') == 1]
Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|
Name | |||||||
EPIC 201833600 c | 840.0 | 14.71 | Terrestrial | 2019 | Transit | 0.97 | 1.0 |
EPIC 206215704 b | 358.0 | 17.83 | Terrestrial | 2019 | Transit | 0.97 | 1.0 |
K2-157 b | 973.0 | 12.94 | Terrestrial | 2018 | Transit | 0.97 | 1.0 |
K2-239 c | 101.0 | 14.63 | Terrestrial | 2018 | Transit | 0.97 | 1.0 |
Kepler-1417 b | 3235.0 | 14.04 | Terrestrial | 2016 | Transit | 0.97 | 1.0 |
Kepler-1464 c | 3757.0 | 14.36 | Terrestrial | 2016 | Transit | 0.97 | 1.0 |
Kepler-392 b | 2223.0 | 13.53 | Terrestrial | 2014 | Transit | 0.97 | 1.0 |
binning_animation()
df
, usedf.plot(
kind='hist',
y=column_name,
density=True
)
ec='w'
to see where bins start and end more clearly.bins
equal to some other integer value.bins
equal to a list or array of bin endpoints.# There are 10 bins by default, some of which are empty.
terr.plot(kind='hist', y='Radius', density=True, ec='w');
terr.plot(kind='hist', y='Radius', density=True, bins=20, ec='w');
terr.plot(kind='hist', y='Radius', density=True, bins=[0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5], ec='w');
In the three histograms above, what is different and what is the same?
'Radius'
.np.arange
.bins=np.arange(4)
creates the bins [0, 1), [1, 2), [2, 3].terr.plot(kind='hist', y='Radius', density=True,
bins=np.arange(0, 3.5, 0.5),
ec='w');
terr.sort_values('Radius', ascending=False)
Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|
Name | |||||||
Kepler-33 c | 3944.0 | 14.10 | Terrestrial | 2011 | Transit | 0.39 | 3.13 |
K2-138 f | 661.0 | 12.25 | Terrestrial | 2017 | Transit | 1.63 | 2.85 |
Kepler-11 b | 2108.0 | 13.82 | Terrestrial | 2010 | Transit | 1.90 | 1.80 |
... | ... | ... | ... | ... | ... | ... | ... |
Kepler-102 b | 352.0 | 12.07 | Terrestrial | 2014 | Transit | 4.30 | 0.47 |
Kepler-444 b | 119.0 | 8.87 | Terrestrial | 2015 | Transit | 0.04 | 0.40 |
Kepler-37 e | 209.0 | 9.77 | Terrestrial | 2014 | Transit Timing Variations | 0.03 | 0.37 |
193 rows × 7 columns
In the above example, the terrestrial exoplanet with the largest radius (Kepler-33 c) is not included because the rightmost bin is [2.5, 3.0] and Kepler-33 c has a 'Radius'
of 3.13.
terr.plot(kind='hist', y='Radius', density=True,
bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');
In the above example, the bins have different widths!
terr.plot(kind='hist', y='Radius', density=True,
bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');
Based on this histogram, what proportion of terrestrial exoplanets have a 'Radius'
between 0.5 and 0.75?
The width of the bin is 0.75 - 0.5 = 0.25.
Therefore, using the formula for the area of a rectangle,
in_range = terr[(terr.get('Radius') >= 0.5) & (terr.get('Radius') < 0.75)].shape[0]
in_range
39
in_range / terr.shape[0]
0.20207253886010362
This matches the result we got. (Not exactly, since we made an estimate for the height.)
Since a bar of a histogram is a rectangle, its area is given by
$$\text{Area} = \text{Height} \times \text{Width}$$That means
$$\text{Height} = \frac{\text{Area}}{\text{Width}} = \frac{\text{Proportion (or Percentage)}}{\text{Width}}$$This implies that the units for height are "proportion per ($x$-axis unit)". The $y$-axis represents a sort of density, which is why we call it a density histogram.
terr.plot(kind='hist', y='Radius', density=True,
bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');
The $y$-axis units here are "proportion per radius", since the $x$-axis represents radius.
ylabel
but we usually don't.Suppose we created a density histogram of people's shoe sizes. 👟 Below are the bins we chose along with their heights.
Bin | Height of Bar |
---|---|
[3, 7) | 0.05 |
[7, 10) | 0.1 |
[10, 12) | 0.15 |
[12, 16] | $X$ |
What should the value of $X$ be so that this is a valid histogram?
A. 0.02 B. 0.05 C. 0.2 D. 0.5 E. 0.7
Bar chart | Histogram |
---|---|
Shows the distribution of a categorical variable | Shows the distribution of a numerical variable |
Plotted from 2 columns of a DataFrame | Plotted from 1 column of a DataFrame |
1 categorical axis, 1 numerical axis | 2 numerical axes |
Bars have arbitrary, but equal, widths and spacing | Horizontal axis is numerical and to scale |
Lengths of bars are proportional to the numerical quantity of interest | Height measures density; areas are proportional to the proportion (percent) of individuals |
In this class, "histogram" will always mean a "density histogram". We will only use density histograms.
Note: It's possible to create what's called a frequency histogram where the $y$-axis simply represents a count of the number of values in each bin.
While easier to interpret, frequency histograms don't have the important property that the total area is 1, so they can't be connected to probability in the same way that density histograms can. This property will be useful to us later on in the course.
Let's look back at the exo
DataFrame.
exo
Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|
Name | |||||||
11 Comae Berenices b | 304.0 | 4.72 | Gas Giant | 2007 | Radial Velocity | 6165.90 | 11.88 |
11 Ursae Minoris b | 409.0 | 5.01 | Gas Giant | 2009 | Radial Velocity | 4684.81 | 11.99 |
14 Andromedae b | 246.0 | 5.23 | Gas Giant | 2008 | Radial Velocity | 1525.58 | 12.65 |
... | ... | ... | ... | ... | ... | ... | ... |
YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |
YZ Ceti c | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.14 | 1.05 |
YZ Ceti d | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.09 | 1.03 |
5043 rows × 7 columns
Can we look at both the average 'Magnitude'
and the average 'Radius'
for each 'Type'
at the same time?
types = exo.groupby('Type').mean()
types
Distance | Magnitude | Year | Mass | Radius | |
---|---|---|---|---|---|
Type | |||||
Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |
Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |
Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |
Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |
types.get(['Magnitude', 'Radius']).plot(kind='barh');
How did we do that?
When calling .plot
, if we omit the y=column_name
argument, all other columns are plotted.
types
Distance | Magnitude | Year | Mass | Radius | |
---|---|---|---|---|---|
Type | |||||
Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |
Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |
Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |
Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |
types.plot(kind='barh');
.get([column_1, ..., column_k])
..get
returns a DataFrame..get([column_name])
will return a DataFrame with just one column!types
Distance | Magnitude | Year | Mass | Radius | |
---|---|---|---|---|---|
Type | |||||
Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |
Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |
Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |
Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |
types.get(['Magnitude', 'Radius'])
Magnitude | Radius | |
---|---|---|
Type | ||
Gas Giant | 10.30 | 12.74 |
Neptune-like | 13.52 | 3.11 |
Super Earth | 13.85 | 1.58 |
Terrestrial | 13.45 | 0.85 |
types.get(['Magnitude', 'Radius']).plot(kind='barh');
Recipe:
.get
only the columns that contain information relevant to your plot (or, equivalently, .drop
all extraneous columns)..plot(x=column_name)
.y
argument. Then all other columns will be plotted on a shared $y$-axis.The same thing works for 'barh'
, 'bar'
, and 'hist'
, but not 'scatter'
.
'mother'
, and 'childHeight'
columns.mother_child = bpd.read_csv('data/galton.csv').get(['mother', 'child'])
mother_child
mother | child | |
---|---|---|
0 | 67.0 | 73.2 |
1 | 67.0 | 69.2 |
2 | 67.0 | 69.0 |
... | ... | ... |
931 | 66.0 | 61.0 |
932 | 63.0 | 66.5 |
933 | 63.0 | 57.0 |
934 rows × 2 columns
alpha
controls how transparent the bars are (alpha=1
is opaque, alpha=0
is transparent).
height_bins = np.arange(55, 80, 2.5)
mother_child.plot(kind='hist', density=True, ec='w',
alpha=0.65, bins=height_bins);
Why do children seem so much taller than their mothers?
Try to answer these questions based on the overlaid histogram.
What proportion of children were between 70 and 75 inches tall?
What proportion of mothers were between 60 and 63 inches tall?
mother_child[(mother_child.get('child') >= 70) & (mother_child.get('child') < 75)].shape[0] / mother_child.shape[0]
Question 2
We can't tell. We could try and breaking it up into the proportion of mothers in $[60, 62.5)$ and $[62.5, 63)$, but we don't know the latter. In the absence of any additional information, we can't infer about the distribution of values within a bin. For example, it could be that everyone in the interval $[62.5, 65)$ actually falls in the interval $[62.5, 63)$ - or it could be that no one does!