In [1]:

```
# Run this cell to set up packages for lecture.
from lec16_imports import *
```

- Extra practice session is
**tonight**. Check out the problem set on your own if you can't make it. - Lab 4 is due
**tomorrow at 11:59PM**. - Homework 4 is due
**Thursday at 11:59PM**. - The Midterm Project was due
**yesterday**but you can still submit it late using slip days.- If working with a partner, this will detract from
**both**partners' allocations.

- If working with a partner, this will detract from
**Monday is a holiday**, so there is no lecture, no office hours, no discussion, no quiz.- Enjoy your day off!

- Chebyshev's inequality.
- Standardization.
- The normal distribution.

where $n$ is the number of observations.

It turns out, in **any** numerical distribution, the bulk of the data are in the range “mean ± a few SDs”.

Let's make this more precise.

**Fact**: In **any** numerical distribution, the proportion of values in the range “mean ± $z$ SDs” is at least

Range | Proportion |
---|---|

mean ± 2 SDs | at least $1 - \frac{1}{4}$ (75%) |

mean ± 3 SDs | at least $1 - \frac{1}{9}$ (88.88..%) |

mean ± 4 SDs | at least $1 - \frac{1}{16}$ (93.75%) |

mean ± 5 SDs | at least $1 - \frac{1}{25}$ (96%) |

In [2]:

```
delays = bpd.read_csv('data/united_summer2015.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');
```

In [3]:

```
delay_mean = delays.get('Delay').mean()
delay_mean
```

Out[3]:

16.658155515370705

In [4]:

```
delay_std = np.std(delays.get('Delay')) # There is no .std() method in babypandas!
delay_std
```

Out[4]:

39.480199851609314

Chebyshev's inequality tells us that

**At least**75% of delays are in the following interval:

In [5]:

```
delay_mean - 2 * delay_std, delay_mean + 2 * delay_std
```

Out[5]:

(-62.30224418784792, 95.61855521858934)

**At least**88.88% of delays are in the following interval:

In [6]:

```
delay_mean - 3 * delay_std, delay_mean + 3 * delay_std
```

Out[6]:

(-101.78244403945723, 135.09875507019865)

Let's visualize these intervals!

In [7]:

```
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, alpha=0.65, ec='w', figsize=(10, 5), title='Flight Delays')
plt.axvline(delay_mean - 2 * delay_std, color='maroon', label='± 2 SD')
plt.axvline(delay_mean + 2 * delay_std, color='maroon')
plt.axvline(delay_mean + 3 * delay_std, color='blue', label='± 3 SD')
plt.axvline(delay_mean - 3 * delay_std, color='blue')
plt.axvline(delay_mean, color='green', label='Mean')
plt.scatter([delay_mean], [-0.0017], color='green', marker='^', s=250)
plt.ylim(-0.0038, 0.06)
plt.legend();
```

Remember, Chebyshev's inequality states that **at least** $1 - \frac{1}{z^2}$ of values are within $z$ SDs from the mean, for any numerical distribution.

For instance, it tells us that **at least** 75% of delays are in the following interval:

In [8]:

```
delay_mean - 2 * delay_std, delay_mean + 2 * delay_std
```

Out[8]:

(-62.30224418784792, 95.61855521858934)

However, in this case, a much larger fraction of delays are in that interval.

In [9]:

```
within_2_sds = delays[(delays.get('Delay') >= delay_mean - 2 * delay_std) &
(delays.get('Delay') <= delay_mean + 2 * delay_std)]
within_2_sds.shape[0] / delays.shape[0]
```

Out[9]:

0.9560940325497288

For a particular set of data points, Chebyshev's inequality states that at least $\frac{8}{9}$ of the data points are between $-20$ and $40$. What is the standard deviation of the data?

We'll work with a data set containing the heights and weights of 5000 adult males.

In [10]:

```
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight
```

Out[10]:

Height | Weight | |
---|---|---|

0 | 73.85 | 241.89 |

1 | 68.78 | 162.31 |

2 | 74.11 | 212.74 |

... | ... | ... |

4997 | 67.01 | 199.20 |

4998 | 71.56 | 185.91 |

4999 | 70.35 | 198.90 |

5000 rows × 2 columns

Let's look at the distributions of both numerical variables.

In [11]:

```
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=30, alpha=0.8, figsize=(10, 5));
```

In [12]:

```
height_and_weight.plot(kind='hist', y='Weight', density=True, ec='w', bins=30, alpha=0.8, color='C1', figsize=(10, 5));
```

In [13]:

```
height_and_weight.plot(kind='hist', density=True, ec='w', bins=60, alpha=0.8, figsize=(10, 5));
```

**Observation**: The two distributions look like shifted and stretched versions of the same basic shape, called a **bell curve** 🔔. Distributions shaped like this are called **normal distributions**.

- There are many normal distributions, with different means and different standard deviations.

- All normal distributions are shaped like bell curves, but they vary in center and spread.

In [14]:

```
show_many_normal_distributions()
```

Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. Then, $$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$

represents $x_i$ in **standard units** – the number of standard deviations $x_i$ is above the mean.

**Example**: Suppose someone weighs 225 pounds. What is their weight in standard units?

In [15]:

```
weights = height_and_weight.get('Weight')
(225 - weights.mean()) / np.std(weights)
```

Out[15]:

1.9201699181580782

- Interpretation: 225 is 1.92 standard deviations above the mean weight.
- 225 becomes 1.92 in
**standard units**.

The process of converting all values of a variable (i.e. a column) to standard units is known as standardization, and the resulting values are considered to be **standardized**.

In [16]:

```
def standard_units(col):
return (col - col.mean()) / np.std(col)
```

In [17]:

```
standardized_height = standard_units(height_and_weight.get('Height'))
standardized_height
```

Out[17]:

0 1.68 1 -0.09 2 1.78 ... 4997 -0.70 4998 0.88 4999 0.46 Name: Height, Length: 5000, dtype: float64

In [18]:

```
standardized_weight = standard_units(height_and_weight.get('Weight'))
standardized_weight
```

Out[18]:

0 2.77 1 -1.25 2 1.30 ... 4997 0.62 4998 -0.06 4999 0.60 Name: Weight, Length: 5000, dtype: float64

Standardized variables have:

- A mean of 0.
- An SD of 1.

We often standardize variables to bring them to the same scale.

In [19]:

```
# e-15 means 10^(-15), which is a very small number, effectively zero.
standardized_height.describe()
```

Out[19]:

count 5.00e+03 mean 1.49e-15 std 1.00e+00 ... 50% 4.76e-04 75% 6.85e-01 max 3.48e+00 Name: Height, Length: 8, dtype: float64

In [20]:

```
standardized_weight.describe()
```

Out[20]:

count 5.00e+03 mean 5.98e-16 std 1.00e+00 ... 50% 6.53e-04 75% 6.74e-01 max 4.19e+00 Name: Weight, Length: 8, dtype: float64

Let's look at how the process of standardization works visually.

In [21]:

```
HTML('data/height_anim.html')
```

Out[21]:

In [22]:

```
HTML('data/weight_anim.html')
```

Out[22]:

Now that we've standardized the distributions of height and weight, let's see how they look on the same set of axes.

In [23]:

```
standardized_height_and_weight = bpd.DataFrame().assign(
Height=standardized_height,
Weight=standardized_weight
)
standardized_height_and_weight.plot(kind='hist', density=True, ec='w',bins=30, alpha=0.8, figsize=(10, 5));
```

These both look pretty similar!

- The distributions we've seen look essentially the same once standardized.
- This distribution is called the
**standard normal distribution**. It is defined by its mean of 0 and its standard deviation of 1. The shape of such a distribution is called the**standard normal curve**.

- You don't need to know the formula – just the shape!
- We'll just use the formula today to make plots.

In [24]:

```
def normal_curve(z):
return 1 / np.sqrt(2 * np.pi) * np.exp((-z**2)/2)
x = np.linspace(-4, 4, 1000)
y = normal_curve(x)
plt.figure(figsize=(10, 5))
plt.plot(x, y, color='black');
plt.xlabel('$z$');
plt.title(r'$\phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^2}$');
```