# Run this cell to set up packages for lecture.
from lec07_imports import *

# There were multiple exoplanets discovered each year.
# What operation can we apply to this DataFrame so that there is one row per year?
exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo

exo.groupby('Year').mean()

exo.groupby('Year').mean().plot(kind='line', y='Magnitude');

df.plot(
    kind='line', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)

types = exo.groupby('Type').mean()
types

types.plot(kind='barh', y='Radius');

types.plot(kind='barh', y='Mass');

df.plot(
    kind='barh', 
    x=categorical_column_name, 
    y=numerical_column_name
)

# Count how many exoplanets are discovered by each detection method.
popular_detection = exo.groupby('Detection').count()
popular_detection

# Give columns more meaningful names and eliminate redundancy.
popular_detection = (popular_detection.assign(Count=popular_detection.get('Distance'))
                                      .get(['Count'])
                                      .sort_values(by='Count', ascending=False)
                    )
popular_detection

# Notice that the bars appear in the opposite order relative to the DataFrame.
popular_detection.plot(kind='barh', y='Count');

# Change "barh" to "bar" to get a vertical bar chart. 
# These are harder to read, but the bars do appear in the same order as the DataFrame.
popular_detection.plot(kind='bar', y='Count');

exo

# Remember, when we group and use .count(), the column names aren't meaningful.
type_counts = exo.groupby('Type').count()
type_counts

# As a result, we could have set y='Magnitude', for example, and gotten the same plot.
type_counts.plot(kind='barh', y='Distance', 
                 legend=False, xlabel='Count', title='Distribution of Exoplanet Types');

exo.groupby('Type').mean().get('Radius')

Type
Gas Giant       12.74
Neptune-like     3.11
Super Earth      1.58
Terrestrial      0.85
Name: Radius, dtype: float64

terr = exo[exo.get('Type') == 'Terrestrial']
terr

terr.get('Radius').describe()

count    193.00
mean       0.85
std        0.26
          ...  
50%        0.86
75%        0.92
max        3.13
Name: Radius, Length: 8, dtype: float64

terr_radius = terr.groupby('Radius').count()
terr_radius = (terr_radius
                 .assign(Count=terr_radius.get('Distance'))
                 .get(['Count'])
              )
terr_radius

terr_radius.plot(kind='bar', y='Count', figsize=(15, 5));

# Ignore the code for right now.
terr.plot(kind='hist', y='Radius', density=True, bins = np.arange(0, 3.5, 0.25), ec='w');

# There are 7 terrestrial exoplanets with a radius of exactly 1.0,
# but the height of the bar starting at 1.0 is not 7!
terr[terr.get('Radius') == 1]

binning_animation()

df.plot(
    kind='hist', 
    y=column_name,
    density=True
)

# There are 10 bins by default, some of which are empty.
terr.plot(kind='hist', y='Radius', density=True, ec='w');

terr.plot(kind='hist', y='Radius', density=True, bins=20, ec='w');

terr.plot(kind='hist', y='Radius', density=True, bins=[0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5], ec='w');

terr.plot(kind='hist', y='Radius', density=True,
            bins=np.arange(0, 3.5, 0.5),
            ec='w');

terr.sort_values('Radius', ascending=False)

terr.plot(kind='hist', y='Radius', density=True,
          bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');

terr.plot(kind='hist', y='Radius', density=True,
          bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');

in_range = terr[(terr.get('Radius') >= 0.5) & (terr.get('Radius') < 0.75)].shape[0]
in_range

39

in_range / terr.shape[0]

0.20207253886010362

terr.plot(kind='hist', y='Radius', density=True,
          bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');

exo

types = exo.groupby('Type').mean()
types

types.get(['Magnitude', 'Radius']).plot(kind='barh');

types

types.plot(kind='barh');

types

types.get(['Magnitude', 'Radius'])

types.get(['Magnitude', 'Radius']).plot(kind='barh');

mother_child = bpd.read_csv('data/galton.csv').get(['mother', 'child'])
mother_child

height_bins = np.arange(55, 80, 2.5)
mother_child.plot(kind='hist', density=True, ec='w',
                  alpha=0.65, bins=height_bins);

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
11 Comae Berenices b	304.0	4.72	Gas Giant	2007	Radial Velocity	6165.90	11.88
11 Ursae Minoris b	409.0	5.01	Gas Giant	2009	Radial Velocity	4684.81	11.99
14 Andromedae b	246.0	5.23	Gas Giant	2008	Radial Velocity	1525.58	12.65
...	...	...	...	...	...	...	...
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91
YZ Ceti c	12.0	12.07	Super Earth	2017	Radial Velocity	1.14	1.05
YZ Ceti d	12.0	12.07	Super Earth	2017	Radial Velocity	1.09	1.03

	Distance	Magnitude	Mass	Radius
Year
1995	50.00	5.45	146.20	13.97
1996	51.33	5.12	1020.67	13.09
1997	57.00	5.41	332.10	13.53
...	...	...	...	...
2021	1944.22	13.01	255.42	4.44
2022	508.61	10.62	943.16	6.77
2023	451.89	12.09	162.78	7.12

	Distance	Magnitude	Year	Mass	Radius
Type
Gas Giant	1096.40	10.30	2013.73	1472.39	12.74
Neptune-like	2189.02	13.52	2016.59	15.28	3.11
Super Earth	1916.26	13.85	2016.43	5.81	1.58
Terrestrial	1373.60	13.45	2016.37	1.62	0.85

	Distance	Magnitude	Type	Year	Mass	Radius
Detection
Astrometry	1	1	1	1	1	1
Direct Imaging	50	50	50	50	50	50
Disk Kinematics	1	1	1	1	1	1
...	...	...	...	...	...	...
Radial Velocity	1019	1019	1019	1019	1019	1019
Transit	3914	3914	3914	3914	3914	3914
Transit Timing Variations	23	23	23	23	23	23

	Count
Detection
Transit	3914
Radial Velocity	1019
Direct Imaging	50
...	...
Astrometry	1
Disk Kinematics	1
Pulsar Timing	1

	Distance	Magnitude	Year	Detection	Mass	Radius
Type
Gas Giant	1480	1480	1480	1480	1480	1480
Neptune-like	1793	1793	1793	1793	1793	1793
Super Earth	1577	1577	1577	1577	1577	1577
Terrestrial	193	193	193	193	193	193

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
EPIC 201497682 b	825.0	13.95	Terrestrial	2019	Transit	0.26	0.69
EPIC 201757695.02	1884.0	14.97	Terrestrial	2020	Transit	0.69	0.91
EPIC 201833600 c	840.0	14.71	Terrestrial	2019	Transit	0.97	1.00
...	...	...	...	...	...	...	...
TRAPPIST-1 e	41.0	17.02	Terrestrial	2017	Transit	0.69	0.92
TRAPPIST-1 h	41.0	17.02	Terrestrial	2017	Transit	0.33	0.76
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
Kepler-33 c	3944.0	14.10	Terrestrial	2011	Transit	0.39	3.13
K2-138 f	661.0	12.25	Terrestrial	2017	Transit	1.63	2.85
Kepler-11 b	2108.0	13.82	Terrestrial	2010	Transit	1.90	1.80
...	...	...	...	...	...	...	...
Kepler-102 b	352.0	12.07	Terrestrial	2014	Transit	4.30	0.47
Kepler-444 b	119.0	8.87	Terrestrial	2015	Transit	0.04	0.40
Kepler-37 e	209.0	9.77	Terrestrial	2014	Transit Timing Variations	0.03	0.37

Bar chart	Histogram
Shows the distribution of a categorical variable	Shows the distribution of a numerical variable
Plotted from 2 columns of a DataFrame	Plotted from 1 column of a DataFrame
1 categorical axis, 1 numerical axis	2 numerical axes
Bars have arbitrary, but equal, widths and spacing	Horizontal axis is numerical and to scale
Lengths of bars are proportional to the numerical quantity of interest	Height measures density; areas are proportional to the proportion (percent) of individuals

	mother	child
0	67.0	73.2
1	67.0	69.2
2	67.0	69.0
...	...	...
931	66.0	61.0
932	63.0	66.5
933	63.0	57.0

Bin	Height of Bar
[3, 7)	0.05
[7, 10)	0.1
[10, 12)	0.15
[12, 16]	$X$

Lecture 7 – Distributions and Histograms¶

DSC 10, Spring 2024¶

Announcements¶

Agenda¶

Line plots 📉¶

Line plots¶

Line plots¶

Bar charts 📊¶

Bar charts¶

Bar charts¶

Bar charts and sorting¶

Distributions¶

What is the distribution of a variable?¶

Categorical variables¶

Terrestrial exoplanets 🌑¶

Visualizing the distribution of 'Radius', a numerical variable¶

Density histograms¶

Density histograms show the distribution of numerical variables¶

First key idea behind histograms: Binning 🗑️¶

Plotting a density histogram¶

Customizing the bins¶

Observations¶

Bin details¶

Second key idea behind histograms: Total area is 1¶

Example calculation¶

Example calculation¶

Check the math 🧮¶

Calculating heights in a density histogram¶

Concept Check ✅ – Answer at cc.dsc10.com¶

Review: Types of visualizations¶

Bar charts vs. histograms¶

🌟 Important 🌟¶

Overlaid plots¶

Multiple plots on the same axes¶

Overlaying plots¶

Selecting multiple columns at once¶

Plotting multiple graphs at once¶

Another example: Heights of children and their parents 👪 📏¶

Plotting overlaid histograms¶

Extra Practice¶

Summary, next time¶

Summary¶

Next time¶

Visualizing the distribution of `'Radius'`, a numerical variable¶

Concept Check ✅ – Answer at cc.dsc10.com ¶