Lecture 8 – Histograms and Overlaid Plots

DSC 10, Fall 2022

Announcements

Agenda

Review: types of visualizations

The type of visualization we create depends on the kinds of variables we're visualizing.

Note: We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

Some bad visualizations

image.png

image.png

image.png

Distributions

What is the distribution of a variable?

Categorical variables

The distribution of a categorical variable can be displayed as a table or bar chart, among other ways! For example, let's look at the colleges of students enrolled in DSC 10 this quarter.

Numerical variables

The distribution of a numerical variable cannot always be accurately depicted with a bar chart. For example, let's look at the number of streams for each of the top 200 songs on Spotify. 🎵

To see the distribution of the number of streams, we need to group by the 'million_streams' column.

Density Histograms

Density histograms show the distribution of numerical variables

Instead of a bar chart, we'll visualize the distribution of a numerical variable with a density histogram. Let's see what a density histogram for 'million_streams' looks like. What do you notice about this visualization?

First key idea behind histograms: binning 🗑️

Plotting a density histogram

Customizing the bins

In the three histograms above, what is different and what is the same?

Observations

Bin details

The outlier (Unholy) is not included because the rightmost bin is [6,7].

Second key idea behind histograms: total area is 1

How to calculate heights in a density histogram

Since a bar of a histogram is a rectangle, its area is given by

$$\text{Area} = \text{Height} \times \text{Width}$$

That means

$$\text{Height} = \frac{\text{Area}}{\text{Width}} = \frac{\text{Proportion (or Percentage)}}{\text{Width}}$$

This implies that the units for height are "proportion per ($x$-axis unit)". The $y$-axis represents a sort of density, which is why we call it a density histogram.

Example calculation

Example calculation

$$\begin{align}\text{Area} &= \text{Height} \times \text{Width} \\ &= 0.25 \text{ per million streams} \times 0.5 \text{ million streams} \\ &= 0.125 \end{align}$$

Check the math

This matches the result we got. (Not exactly, since we made a rough guess for the height.)

Concept Check ✅ – Answer at cc.dsc10.com

Suppose we created a density histogram of people's shoe sizes. 👟 Below are the bins we chose along with their heights.

Bin Height of Bar
[3, 7) 0.05
[7, 10) 0.1
[10, 12) 0.15
[12, 16] $X$

What should the value of $X$ be so that this is a valid histogram?

A. 0.02              B. 0.05              C. 0.2              D. 0.5              E. 0.7             

Bar charts vs. histograms

Bar Chart Histogram
Shows the distribution of a categorical variable Shows the distribution of a numerical variable
1 categorical axis, 1 numerical axis 2 numerical axes
Bars have arbitrary, but equal, widths and spacing Horizontal axis is numerical and to scale
Lengths of bars are proportional to the numerical quantity of interest Height measures density; areas are proportional to the proportion (percent) of individuals

🌟 Important 🌟

In this class, "histogram" will always mean a "density histogram". We will only use density histograms.

Note: It's possible to create what's called a frequency histogram where the $y$-axis simply represents a count of the number of values in each bin. While easier to interpret, frequency histograms don't have the important property that the total area is 1, so they can't be connected to probability in the same way that density histograms can. That makes them far less useful for data scientists.

Overlaid plots

New dataset: populations of San Diego and San Jose over time

The data for both cities comes from macrotrends.net.

Recall: line plots

Notice the optional title and legend arguments. Some other useful optional arguments are figsize, xlabel, and ylabel. There are many optional arguments.

Overlaying plots

If y=column_name is omitted, all columns are plotted!

Why are there only three lines shown, but four in the legend? 🤔

Selecting multiple columns at once

To plot multiple graphs at once:

The same thing works for 'barh', 'bar', and 'hist', but not 'scatter'.

New dataset: heights of children and their parents 👪 📏

Plotting overlaid histograms

alpha controls how transparent the bars are (alpha=1 is opaque, alpha=0 is transparent).

Why do children seem so much taller than their mothers?

Extra Practice

Try to answer these questions based on the overlaid histogram.

  1. What proportion of children were between 70 and 75 inches tall?

  2. What proportion of mothers were between 60 and 63 inches tall?

Answers

Click here to show. Question 1 The height of the $[70, 72.5)$ bar is around $0.08$, meaning that $0.08 \cdot 2.5 = 0.2$ of children had heights in that interval. The height of the $[72.5, 75)$ bar is around $0.02$, meaning $0.02 \cdot 2.5 = 0.05$ of children had heights in that interval. Thus, the overall proportion of children who were between $70$ and $75$ inches tall was around $0.20 + 0.05 = 0.25$, or $25\%$. To verify our answer, we can run heights[(heights.get('childHeight') >= 70) & (heights.get('childHeight') < 75)].shape[0] / heights.shape[0] Question 2 We can't tell. We could try and breaking it up into the proportion of mothers in $[60, 62.5)$ and $[62.5, 63)$, but we don't know the latter. In the absence of any additional information, we can't infer about the distribution of values within a bin. For example, it could be that everyone in the interval $[62.5, 65)$ actually falls in the interval $[62.5, 63)$ - or it could be that no one does!

Summary, next time

Summary

Next time