In [1]:

```
# Run this cell to set up packages for lecture.
from lec07_imports import *
```

- Quiz 1 scores are released!
- Remember, quizzes are a place where it's okay to fail and learn from your mistakes. Your lowest two quiz scores are dropped!

- Homework 1 is due
**tomorrow at 11:59PM**. - Lab 2 is due
**Tuesday at 11:59PM**. - The class is really picking up with quizzes, labs, and homeworks - start assignments early to not fall behind! 🏃♀️🏃

- Distributions.
- Density histograms.
- Overlaid plots.

Today's material is quite theoretical – you can practice with it in Friday's extra practice session!

The type of visualization we create depends on the kinds of variables we're visualizing.

**Scatter plot**: Numerical vs. numerical.- Example: Weight vs. height.

**Line plot**: Sequential numerical (time) vs. numerical.- Example: Height vs. time.

**Bar chart**: Categorical vs. numerical.- Example: Heights of different family members.

**Histogram**: Distribution of numerical.

We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

- The distribution of a variable consists of all values of the variable that occur in the data, along with their frequencies.
- Distributions help you understand:
*How often does a variable take on a certain value?* - Both categorical and numerical variables have distributions.

The distribution of a categorical variable can be displayed as a table or bar chart, among other ways!

For example, let's look at the distribution of exoplanet `'Type'`

s. To do so, we'll need to group.

In [2]:

```
exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo
```

Out[2]:

Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|

Name | |||||||

11 Comae Berenices b | 304.0 | 4.72 | Gas Giant | 2007 | Radial Velocity | 6165.90 | 11.88 |

11 Ursae Minoris b | 409.0 | 5.01 | Gas Giant | 2009 | Radial Velocity | 4684.81 | 11.99 |

14 Andromedae b | 246.0 | 5.23 | Gas Giant | 2008 | Radial Velocity | 1525.58 | 12.65 |

... | ... | ... | ... | ... | ... | ... | ... |

YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |

YZ Ceti c | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.14 | 1.05 |

YZ Ceti d | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.09 | 1.03 |

5043 rows × 7 columns

In [3]:

```
# Remember, when we group and use .count(), the column names aren't meaningful.
type_counts = exo.groupby('Type').count()
type_counts
```

Out[3]:

Distance | Magnitude | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|

Type | ||||||

Gas Giant | 1480 | 1480 | 1480 | 1480 | 1480 | 1480 |

Neptune-like | 1793 | 1793 | 1793 | 1793 | 1793 | 1793 |

Super Earth | 1577 | 1577 | 1577 | 1577 | 1577 | 1577 |

Terrestrial | 193 | 193 | 193 | 193 | 193 | 193 |

In [4]:

```
# As a result, we could have set y='Magnitude', for example, and gotten the same plot.
type_counts.plot(kind='barh', y='Distance',
legend=False, xlabel='Count', title='Distribution of Exoplanet Types');
```

`title`

argument. Some other useful optional arguments are `legend`

, `figsize`

, `xlabel`

, and `ylabel`

. There are many optional arguments.

`'Type'`

.

In [5]:

```
exo.groupby('Type').mean().get('Radius')
```

Out[5]:

Type Gas Giant 12.74 Neptune-like 3.11 Super Earth 1.58 Terrestrial 0.85 Name: Radius, dtype: float64

Let's look into them further!

In [6]:

```
terr = exo[exo.get('Type') == 'Terrestrial']
terr
```

Out[6]:

Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|

Name | |||||||

EPIC 201497682 b | 825.0 | 13.95 | Terrestrial | 2019 | Transit | 0.26 | 0.69 |

EPIC 201757695.02 | 1884.0 | 14.97 | Terrestrial | 2020 | Transit | 0.69 | 0.91 |

EPIC 201833600 c | 840.0 | 14.71 | Terrestrial | 2019 | Transit | 0.97 | 1.00 |

... | ... | ... | ... | ... | ... | ... | ... |

TRAPPIST-1 e | 41.0 | 17.02 | Terrestrial | 2017 | Transit | 0.69 | 0.92 |

TRAPPIST-1 h | 41.0 | 17.02 | Terrestrial | 2017 | Transit | 0.33 | 0.76 |

YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |

193 rows × 7 columns

`'Radius'`

column of `terr`

. To learn more about it, we can use the `.describe()`

method.

In [7]:

```
terr.get('Radius').describe()
```

Out[7]:

count 193.00 mean 0.85 std 0.26 ... 50% 0.86 75% 0.92 max 3.13 Name: Radius, Length: 8, dtype: float64

But how do we visualize its distribution?

`'Radius'`

, a numerical variable¶- A few slides ago, we looked at the distribution of
`'Type'`

, which is a categorical variable. - Now, we'll look at the distribution of
`'Radius'`

, which is a numerical variable. - As we'll see,
**a bar chart is not the right choice of visualization for the distribution of a numerical variable**.

`'Radius'`

, we need to group by that column and count how many terrestrial planets there are of each radius.

In [8]:

```
terr_radius = terr.groupby('Radius').count()
terr_radius = (terr_radius
.assign(Count=terr_radius.get('Distance'))
.get(['Count'])
)
terr_radius
```

Out[8]:

Count | |
---|---|

Radius | |

0.37 | 1 |

0.40 | 1 |

0.47 | 1 |

... | ... |

1.80 | 1 |

2.85 | 1 |

3.13 | 1 |

85 rows × 1 columns

In [9]:

```
terr_radius.plot(kind='bar', y='Count', figsize=(15, 5));
```

The horizontal axis should be numerical (like a number line), not categorical. There should be more space between certain bars than others.

For instance, the planet with `'Radius'`

1.8 is 80% larger than the planet with `'Radius'`

1, but they appear to be about the same size here.

Instead of a bar chart, we'll visualize the distribution of a numerical variable with a **density histogram**. Let's see what a density histogram for `'Radius'`

looks like. What do you notice about this visualization?

In [10]:

```
# Ignore the code for right now.
terr.plot(kind='hist', y='Radius', density=True, bins = np.arange(0, 3.5, 0.25), ec='w');
```

In [11]:

```
# There are 7 terrestrial exoplanets with a radius of exactly 1.0,
# but the height of the bar starting at 1.0 is not 7!
terr[terr.get('Radius') == 1]
```

Out[11]:

Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|

Name | |||||||

EPIC 201833600 c | 840.0 | 14.71 | Terrestrial | 2019 | Transit | 0.97 | 1.0 |

EPIC 206215704 b | 358.0 | 17.83 | Terrestrial | 2019 | Transit | 0.97 | 1.0 |

K2-157 b | 973.0 | 12.94 | Terrestrial | 2018 | Transit | 0.97 | 1.0 |

K2-239 c | 101.0 | 14.63 | Terrestrial | 2018 | Transit | 0.97 | 1.0 |

Kepler-1417 b | 3235.0 | 14.04 | Terrestrial | 2016 | Transit | 0.97 | 1.0 |

Kepler-1464 c | 3757.0 | 14.36 | Terrestrial | 2016 | Transit | 0.97 | 1.0 |

Kepler-392 b | 2223.0 | 13.53 | Terrestrial | 2014 | Transit | 0.97 | 1.0 |

- Binning is the act of counting the number of numerical values that fall within ranges defined by two endpoints. These ranges are called “bins”.
- A value falls in a bin if it is
**greater than or equal to the left**endpoint and**less than the right**endpoint.- [a, b): a is included, b is not.

- The width of a bin is its right endpoint minus its left endpoint.

In [12]:

```
binning_animation()
```

**Density histograms**(not bar charts!) visualize the distribution of a single numerical variable by placing numbers into bins.- To create one from a DataFrame
`df`

, usedf.plot( kind='hist', y=column_name, density=True )

- Optional but recommended: Use
`ec='w'`

to see where bins start and end more clearly.

- By default, Python will bin your data into 10 equally sized bins.
- You can specify another number of equally sized bins by setting the optional argument
`bins`

equal to some other integer value. - You can also specify custom bin start and endpoints by setting
`bins`

equal to a list or array of bin endpoints.

In [13]:

```
# There are 10 bins by default, some of which are empty.
terr.plot(kind='hist', y='Radius', density=True, ec='w');
```

In [14]:

```
terr.plot(kind='hist', y='Radius', density=True, bins=20, ec='w');
```

In [15]:

```
terr.plot(kind='hist', y='Radius', density=True, bins=[0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5], ec='w');
```

In the three histograms above, what is different and what is the same?

- The general shape of all three histograms is the same, regardless of the bins.
- More bins gives a finer, more granular picture of the distribution of the variable
`'Radius'`

. - The $y$-axis values seem to change a lot when we change the bins. Hang onto that thought; we'll see why shortly.

- In a histogram, only the last bin is inclusive of the right endpoint!
- The bins you specify don't have to include all data values; data values not in any bin won't be shown in the histogram.
- For equally sized bins, use
`np.arange`

.- Be
**very careful**with the endpoints. - For example,
`bins=np.arange(4)`

creates the bins [0, 1), [1, 2), [2, 3].

- Be
- Bins can have different sizes!

In [16]:

```
terr.plot(kind='hist', y='Radius', density=True,
bins=np.arange(0, 3.5, 0.5),
ec='w');
```

In [17]:

```
terr.sort_values('Radius', ascending=False)
```

Out[17]:

Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|

Name | |||||||

Kepler-33 c | 3944.0 | 14.10 | Terrestrial | 2011 | Transit | 0.39 | 3.13 |

K2-138 f | 661.0 | 12.25 | Terrestrial | 2017 | Transit | 1.63 | 2.85 |

Kepler-11 b | 2108.0 | 13.82 | Terrestrial | 2010 | Transit | 1.90 | 1.80 |

... | ... | ... | ... | ... | ... | ... | ... |

Kepler-102 b | 352.0 | 12.07 | Terrestrial | 2014 | Transit | 4.30 | 0.47 |

Kepler-444 b | 119.0 | 8.87 | Terrestrial | 2015 | Transit | 0.04 | 0.40 |

Kepler-37 e | 209.0 | 9.77 | Terrestrial | 2014 | Transit Timing Variations | 0.03 | 0.37 |

193 rows × 7 columns

`'Radius'`

of 3.13.

In [18]:

```
terr.plot(kind='hist', y='Radius', density=True,
bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');
```

In the above example, the bins have different widths!

- In a density histogram, the $y$-axis can be hard to interpret, but it's designed to give the histogram a very nice property:
**The bars of a density histogram**

have a combined total area of 1.

- Important:
**The area of a bar is equal to the proportion of all data points that fall into that bin**.

- Recall from the pretest, proportions and percentages represent the same thing.
- A proportion is a decimal between 0 and 1, a percentage is between 0\% and 100\%.
- The proportion 0.34 means 34\%.

In [19]:

```
terr.plot(kind='hist', y='Radius', density=True,
bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');
```

`'Radius'`

between 0.5 and 0.75?

- The height of the [0.5, 0.75) bar looks to be around 0.8.
The width of the bin is 0.75 - 0.5 = 0.25.

Therefore, using the formula for the area of a rectangle,

- Since areas represent proportions, this means that the proportion of terrestrial exoplanets with a radius between 0.5 and 0.75 is about 0.2 (or 20\%).

In [20]:

```
in_range = terr[(terr.get('Radius') >= 0.5) & (terr.get('Radius') < 0.75)].shape[0]
in_range
```

Out[20]:

39

In [21]:

```
in_range / terr.shape[0]
```

Out[21]:

0.20207253886010362

This matches the result we got. (Not exactly, since we made an estimate for the height.)

Since a bar of a histogram is a rectangle, its area is given by

$$\text{Area} = \text{Height} \times \text{Width}$$That means

$$\text{Height} = \frac{\text{Area}}{\text{Width}} = \frac{\text{Proportion (or Percentage)}}{\text{Width}}$$*density*, which is why we call it a density histogram.

In [22]:

```
terr.plot(kind='hist', y='Radius', density=True,
bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');
```

The $y$-axis units here are "proportion per radius", since the $x$-axis represents radius.

- Unfortunately, the $y$-axis units on the histogram always displays as "Frequency".
**This is wrong!** - We can fix this with the optional argument
`ylabel`

but we usually don't.

Suppose we created a density histogram of people's shoe sizes. 👟 Below are the bins we chose along with their heights.

Bin | Height of Bar |
---|---|

[3, 7) | 0.05 |

[7, 10) | 0.1 |

[10, 12) | 0.15 |

[12, 16] | $X$ |

What should the value of $X$ be so that this is a valid histogram?

A. 0.02 B. 0.05 C. 0.2 D. 0.5 E. 0.7

Bar chart | Histogram |
---|---|

Shows the distribution of a categorical variable | Shows the distribution of a numerical variable |

Plotted from 2 columns of a DataFrame | Plotted from 1 column of a DataFrame |

1 categorical axis, 1 numerical axis | 2 numerical axes |

Bars have arbitrary, but equal, widths and spacing | Horizontal axis is numerical and to scale |

Lengths of bars are proportional to the numerical quantity of interest | Height measures density; areas are proportional to the proportion (percent) of individuals |

In this class, **"histogram" will always mean a "density histogram".** We will **only** use density histograms.

*Note:* It's possible to create what's called a *frequency histogram* where the $y$-axis simply represents a count of the number of values in each bin.

While easier to interpret, frequency histograms don't have the important property that the total area is 1, so they can't be connected to probability in the same way that density histograms can. This property will be useful to us later on in the course.

Let's look back at the `exo`

DataFrame.

In [23]:

```
exo
```

Out[23]:

Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
---|---|---|---|---|---|---|---|

Name | |||||||

11 Comae Berenices b | 304.0 | 4.72 | Gas Giant | 2007 | Radial Velocity | 6165.90 | 11.88 |

11 Ursae Minoris b | 409.0 | 5.01 | Gas Giant | 2009 | Radial Velocity | 4684.81 | 11.99 |

14 Andromedae b | 246.0 | 5.23 | Gas Giant | 2008 | Radial Velocity | 1525.58 | 12.65 |

... | ... | ... | ... | ... | ... | ... | ... |

YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |

YZ Ceti c | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.14 | 1.05 |

YZ Ceti d | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.09 | 1.03 |

5043 rows × 7 columns

`'Magnitude'`

and the average `'Radius'`

for each `'Type'`

at the same time?

In [24]:

```
types = exo.groupby('Type').mean()
types
```

Out[24]:

Distance | Magnitude | Year | Mass | Radius | |
---|---|---|---|---|---|

Type | |||||

Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |

Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |

Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |

Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |

In [25]:

```
types.get(['Magnitude', 'Radius']).plot(kind='barh');
```

How did we do that?

When calling `.plot`

, if we omit the `y=column_name`

argument, **all other columns** are plotted.

In [26]:

```
types
```

Out[26]:

Distance | Magnitude | Year | Mass | Radius | |
---|---|---|---|---|---|

Type | |||||

Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |

Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |

Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |

Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |

In [27]:

```
types.plot(kind='barh');
```

- To select multiple columns, use
`.get([column_1, ..., column_k])`

. - Passing a list of column labels to
`.get`

returns a DataFrame.`.get([column_name])`

will return a DataFrame with just one column!

In [28]:

```
types
```

Out[28]:

Distance | Magnitude | Year | Mass | Radius | |
---|---|---|---|---|---|

Type | |||||

Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |

Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |

Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |

Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |

In [29]:

```
types.get(['Magnitude', 'Radius'])
```

Out[29]:

Magnitude | Radius | |
---|---|---|

Type | ||

Gas Giant | 10.30 | 12.74 |

Neptune-like | 13.52 | 3.11 |

Super Earth | 13.85 | 1.58 |

Terrestrial | 13.45 | 0.85 |

In [30]:

```
types.get(['Magnitude', 'Radius']).plot(kind='barh');
```

Recipe:

`.get`

only the columns that contain information relevant to your plot (or, equivalently,`.drop`

all extraneous columns).- Specify the column for the $x$-axis (if not the index) in
`.plot(x=column_name)`

. - Omit the
`y`

argument. Then**all**other columns will be plotted on a shared $y$-axis.

The same thing works for `'barh'`

, `'bar'`

, and `'hist'`

, but not `'scatter'`

.

- The data below was collected in the late 1800s by Francis Galton.
- He was a eugenicist and proponent of scientific racism, which is why he collected this data.
- Today, we understand that eugenics is immoral, and that there is no scientific evidence or any other justification for racism.

- We will revisit this dataset later on in the course.
- For now, we'll only need the
`'mother'`

, and`'childHeight'`

columns.

In [31]:

```
mother_child = bpd.read_csv('data/galton.csv').get(['mother', 'child'])
mother_child
```

Out[31]:

mother | child | |
---|---|---|

0 | 67.0 | 73.2 |

1 | 67.0 | 69.2 |

2 | 67.0 | 69.0 |

... | ... | ... |

931 | 66.0 | 61.0 |

932 | 63.0 | 66.5 |

933 | 63.0 | 57.0 |

934 rows × 2 columns

`alpha`

controls how transparent the bars are (`alpha=1`

is opaque, `alpha=0`

is transparent).

In [32]:

```
height_bins = np.arange(55, 80, 2.5)
mother_child.plot(kind='hist', density=True, ec='w',
alpha=0.65, bins=height_bins);
```

Why do children seem so much taller than their mothers?

Try to answer these questions based on the overlaid histogram.

What proportion of children were between 70 and 75 inches tall?

What proportion of mothers were between 60 and 63 inches tall?

`mother_child[(mother_child.get('child') >= 70) & (mother_child.get('child') < 75)].shape[0] / mother_child.shape[0]`

- Histograms (not bar charts!) are used to display the distribution of a numerical variable.
- We will always use density histograms in this course.
- In a density histogram, the area of a bar represents the proportion (percentage) of values within its bin.
- The total area of all bars is 1 (100%).

- We can overlay multiple line plots, bar charts, and histograms on top of one another to look at multiple relationships or distributions.

- Writing our own functions.
- Applying functions to the data in a DataFrame.