In [1]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg")
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import HTML, display, IFrame

Lecture 6 – Data Visualization 📈¶

DSC 10, Winter 2023¶

Announcements¶

  • Homework 1 is due tomorrow at 11:59PM.
  • Lab 2 is due Saturday at 11:59PM.
  • Come to office hours for help! Mine are 12:30-2 today and not this Friday. See the calendar for directions.

Aside: keyboard shortcuts¶

There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).

Particularly useful shortcuts:

Action Keyboard shortcut
Run cell + jump to next cell SHIFT + ENTER
Save the notebook CTRL/CMD + S
Create new cell above/below A/B
Delete cell DD

Agenda¶

  • Why visualize?
  • Terminology.
  • Scatter plots.
  • Line plots.
  • Bar charts.

Don't forget about the DSC 10 Reference Sheet and the Resources tab of the course website!

Why visualize?¶

Run these cells to load the Little Women data from Lecture 1.

In [2]:
chapters = open('data/lw.txt').read().split('CHAPTER ')[1:]
In [3]:
# Counts of names in the chapters of Little Women

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)

# cumulative number of times each name appears

lw_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 48, 1)
)

lw_counts
Out[3]:
Amy Beth Jo Meg Laurie Chapter
0 23 26 44 26 0 1
1 36 38 65 46 0 2
2 38 40 127 82 16 3
... ... ... ... ... ... ...
44 633 461 1450 675 581 45
45 635 462 1506 679 583 46
46 645 465 1543 685 596 47

47 rows × 6 columns

Little Women¶

In Lecture 1, we were able to answer questions about the plot of Little Women without having to read the novel and without having to understand Python code. Some of those questions included:

  • Who is the main character?
  • Which pair of characters gets married in Chapter 35?

We answered these questions from a data visualization alone!

In [4]:
lw_counts.plot(x='Chapter');
2023-01-22T21:31:15.114701 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Napoleon's March¶

Why visualize?¶

  • Computers are better than humans at crunching numbers, but humans are better at identifying visual patterns.
  • Visualizations allow us to understand lots of data quickly – they make it easier to spot trends and communicate our results with others.
  • There are many types of visualizations; in this class, we'll look at scatter plots, line plots, bar charts, and histograms, but there are many others.
    • The right choice depends on the type of data.

Terminology¶

Individuals and variables¶

  • Individual (row): Person/place/thing for which data is recorded. Also called an observation.
  • Variable (column): Something that is recorded for each individual. Also called a feature.

Types of variables¶

There are two main types of variables:

  • Numerical: It makes sense to do arithmetic with the values.
  • Categorical: Values fall into categories, that may or may not have some order to them.

Examples of numerical variables¶

  • Salaries of NBA players 🏀.
    • Individual: an NBA player.
    • Variable: their salary.
  • Movie gross earnings 💰.
    • Individual: a movie.
    • Variable: its gross earnings.
  • Booster doses administered per day 💉.
    • Individual: date.
    • Variable: number of booster doses administered on that date.

Examples of categorical variables¶

  • Movie genres 🎬.
    • Individual: a movie.
    • Variable: its genre.
  • Zip codes 🏠.
    • Individual: US resident.
    • Variable: zip code.
      • Even though they look like numbers, zip codes are categorical (arithmetic doesn't make sense).
  • Level of prior programming experience for students in DSC 10 🧑‍🎓.
    • Individual: student in DSC 10.
    • Variable: their level of prior programming experience, e.g. none, low, medium, or high.
      • There is an order to these categories!

Concept Check ✅ – Answer at cc.dsc10.com¶

Which of these is not a numerical variable?

A. Fuel economy in miles per gallon.

B. Number of quarters at UCSD.

C. College at UCSD (Sixth, Seventh, etc).

D. Bank account number.

E. More than one of these are not numerical variables.

Types of visualizations¶

The type of visualization we create depends on the kinds of variables we're visualizing.

  • Scatter plot: numerical vs. numerical.
  • Line plot: sequential numerical (time) vs. numerical.
  • Bar chart: categorical vs. numerical.
  • Histogram: numerical.
    • Will cover next time.

Note: We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

Scatter plots¶

Dataset of 50 top-grossing actors¶

Column Contents

'Actor'|Name of actor 'Total Gross'| Total gross domestic box office receipt, in millions of dollars, of all of the actor’s movies 'Number of Movies'| The number of movies the actor has been in 'Average per Movie'| Total gross divided by number of movies '#1 Movie'| The highest grossing movie the actor has been in 'Gross'| Gross domestic box office receipt, in millions of dollars, of the actor’s #1 Movie

In [5]:
actors = bpd.read_csv('data/actors.csv').set_index('Actor')
actors
Out[5]:
Total Gross Number of Movies Average per Movie #1 Movie Gross
Actor
Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
... ... ... ... ... ...
Sandra Bullock 2462.6 35 70.4 Minions 336.0
Chris Evans 2457.8 23 106.9 The Avengers 623.4
Anne Hathaway 2416.5 25 96.7 The Dark Knight Rises 448.1

50 rows × 5 columns

Scatter plots¶

What is the relationship between 'Number of Movies' and 'Total Gross'?

In [6]:
actors.plot(kind='scatter', x='Number of Movies', y='Total Gross');
2023-01-22T21:31:15.266944 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Scatter plots¶

  • Scatter plots visualize the relationship between two numerical variables.
  • To create one from a DataFrame df, use
    df.plot(
      kind='scatter', 
      x=x_column_for_horizontal, 
      y=y_column_for_vertical
    )
  • The resulting scatter plot has one point per row of df.
  • If you put a semicolon after a call to .plot, it will hide the weird text output that displays.

Scatter plots¶

What is the relationship between 'Number of Movies' and 'Average per Movie'?

In [7]:
actors.plot(kind='scatter', x='Number of Movies', y='Average per Movie');
2023-01-22T21:31:15.393572 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Note that in the above plot, there's a negative association and an outlier.

Who was in 60 or more movies?¶

In [8]:
actors[actors.get('Number of Movies') >= 60]
Out[8]:
Total Gross Number of Movies Average per Movie #1 Movie Gross
Actor
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Bruce Willis 3189.4 60 53.2 Sixth Sense 293.5
Robert DeNiro 3081.3 79 39.0 Meet the Fockers 279.3
Liam Neeson 2942.7 63 46.7 The Phantom Menace 474.5

Who is the outlier?¶

Whoever they are, they made very few, high grossing movies.

In [9]:
actors[actors.get('Number of Movies') < 10]
Out[9]:
Total Gross Number of Movies Average per Movie #1 Movie Gross
Actor
Anthony Daniels 3162.9 7 451.8 Star Wars: The Force Awakens 936.7

Anthony Daniels¶

Line plots 📉¶

Dataset aggregating movies by year¶

Column Content

'Year'| Year 'Total Gross in Billions'| Total domestic box office gross, in billions of dollars, of all movies released 'Number of Movies'| Number of movies released '#1 Movie'| Highest grossing movie

In [10]:
movies_by_year = bpd.read_csv('data/movies_by_year.csv').set_index('Year')
movies_by_year
Out[10]:
Total Gross in Billions Number of Movies #1 Movie
Year
2022 5.64 380 Top Gun: Maverick
2021 4.48 439 Spider-Man: No Way Home
2020 2.11 456 Bad Boys for Life
... ... ... ...
1979 1.23 40 Superman
1978 0.83 13 Grease
1977 0.44 9 Star Wars: Episode IV - A New Hope

46 rows × 3 columns

Line plots¶

How has the number of movies changed over time? 🤔

In [11]:
movies_by_year.plot(kind='line', y='Number of Movies');
2023-01-22T21:31:15.566103 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Line plots¶

  • Line plots show trends in numerical variables over time.
  • To create one from a DataFrame df, use
    df.plot(
      kind='line', 
      x=x_column_for_horizontal, 
      y=y_column_for_vertical
    )

Plotting tip¶

  • Tip: if you want the x-axis to be the index, omit the x= argument!
  • Doesn't work for scatter plots, but works for most other plot types.
In [12]:
movies_by_year.plot(kind='line', y='Number of Movies');
2023-01-22T21:31:15.688295 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Since the year 2000¶

We can create a line plot of just 2000 onwards by querying movies_by_year before calling .plot.

In [13]:
movies_by_year[movies_by_year.index >= 2000].plot(kind='line', y='Number of Movies');
2023-01-22T21:31:15.837785 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

What do you think explains the declines around 2008 and 2020?

How did this affect total gross?¶

In [14]:
movies_by_year[movies_by_year.index >= 2000].plot(kind='line', y='Total Gross in Billions');
2023-01-22T21:31:15.961809 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

What was the top grossing movie of 2016? 🐟¶

In [15]:
...
Out[15]:
Ellipsis

Bar charts 📊¶

Dataset of the top 200 songs in the US on Spotify as of Saturday (1/21/23)¶

Downloaded from here – check it out!

In [16]:
charts = (bpd.read_csv('data/regional-us-daily-2023-01-21.csv')
          .set_index('rank')
          .get(['track_name', 'artist_names', 'streams', 'uri'])
         )
charts
Out[16]:
track_name artist_names streams uri
rank
1 Flowers Miley Cyrus 3356361 spotify:track:0yLdNVWF3Srea0uzk55zFn
2 Kill Bill SZA 2479445 spotify:track:1Qrg8KqiBpW07V7PNxwwwL
3 Creepin' (with The Weeknd & 21 Savage) Metro Boomin, The Weeknd, 21 Savage 1337320 spotify:track:2dHHgzDwk4BJdRwy9uXhTO
... ... ... ... ...
198 Major Distribution Drake, 21 Savage 266986 spotify:track:46s57QULU02Voy0Kup6UEb
199 Sun to Me Zach Bryan 266968 spotify:track:1SjsVdSXpwm1kTdYEHoPIT
200 The Real Slim Shady Eminem 266698 spotify:track:3yfqSUWxFvZELEM4PmlwIR

200 rows × 4 columns

Bar charts¶

How many streams do the top 10 songs have?

In [17]:
charts
Out[17]:
track_name artist_names streams uri
rank
1 Flowers Miley Cyrus 3356361 spotify:track:0yLdNVWF3Srea0uzk55zFn
2 Kill Bill SZA 2479445 spotify:track:1Qrg8KqiBpW07V7PNxwwwL
3 Creepin' (with The Weeknd & 21 Savage) Metro Boomin, The Weeknd, 21 Savage 1337320 spotify:track:2dHHgzDwk4BJdRwy9uXhTO
... ... ... ... ...
198 Major Distribution Drake, 21 Savage 266986 spotify:track:46s57QULU02Voy0Kup6UEb
199 Sun to Me Zach Bryan 266968 spotify:track:1SjsVdSXpwm1kTdYEHoPIT
200 The Real Slim Shady Eminem 266698 spotify:track:3yfqSUWxFvZELEM4PmlwIR

200 rows × 4 columns

In [18]:
charts.take(np.arange(10))
Out[18]:
track_name artist_names streams uri
rank
1 Flowers Miley Cyrus 3356361 spotify:track:0yLdNVWF3Srea0uzk55zFn
2 Kill Bill SZA 2479445 spotify:track:1Qrg8KqiBpW07V7PNxwwwL
3 Creepin' (with The Weeknd & 21 Savage) Metro Boomin, The Weeknd, 21 Savage 1337320 spotify:track:2dHHgzDwk4BJdRwy9uXhTO
... ... ... ... ...
8 Anti-Hero Taylor Swift 936166 spotify:track:0V3wPSX9ygBnCm8psDIegu
9 golden hour JVKE 870031 spotify:track:5odlY52u43F5BjByhxg7wg
10 Unholy (feat. Kim Petras) Sam Smith, Kim Petras 859271 spotify:track:3nqQXoyQOWXiESFLlDF1hG

10 rows × 4 columns

In [19]:
charts.take(np.arange(10)).plot(kind='barh', x='track_name', y='streams');
2023-01-22T21:31:16.224570 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Bar charts¶

  • Bar charts visualize the relationship between a categorical variable and a numerical variable.
  • In a bar chart...
    • The thickness and spacing of bars is arbitrary.
    • The order of the categorical labels doesn't matter.
  • To create one from a DataFrame df, use
    df.plot(
      kind='barh', 
      x=categorical_column_name, 
      y=numerical_column_name
    )
  • The "h" in 'barh' stands for "horizontal".
    • It's easier to read labels this way.
  • In the previous chart, we set y='Streams' even though streams are measured by x-axis length.
In [20]:
# The bars appear in the opposite order relative to the DataFrame
(charts
 .take(np.arange(10))
 .sort_values(by='streams')
 .plot(kind='barh', x='track_name', y='streams')
);
2023-01-22T21:31:16.385918 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

How many songs do the top 15 artists have in the top 200?¶

First, let's create a DataFrame with a single column that describes the number of songs in the top 200 per artist. This involves using .groupby with .count(). Since we want one row per artist, we will group by 'artist_names'.

In [21]:
charts
Out[21]:
track_name artist_names streams uri
rank
1 Flowers Miley Cyrus 3356361 spotify:track:0yLdNVWF3Srea0uzk55zFn
2 Kill Bill SZA 2479445 spotify:track:1Qrg8KqiBpW07V7PNxwwwL
3 Creepin' (with The Weeknd & 21 Savage) Metro Boomin, The Weeknd, 21 Savage 1337320 spotify:track:2dHHgzDwk4BJdRwy9uXhTO
... ... ... ... ...
198 Major Distribution Drake, 21 Savage 266986 spotify:track:46s57QULU02Voy0Kup6UEb
199 Sun to Me Zach Bryan 266968 spotify:track:1SjsVdSXpwm1kTdYEHoPIT
200 The Real Slim Shady Eminem 266698 spotify:track:3yfqSUWxFvZELEM4PmlwIR

200 rows × 4 columns

In [22]:
songs_per_artist = charts.groupby('artist_names').count()
songs_per_artist
Out[22]:
track_name streams uri
artist_names
21 Savage, Metro Boomin 1 1 1
80purppp 1 1 1
A Boogie Wit da Hoodie 1 1 1
... ... ... ...
Zach Bryan 4 4 4
d4vd 2 2 2
Ñengo Flow, Bad Bunny 1 1 1

145 rows × 3 columns

Using .sort_values and .take, we'll keep just the top 15 artists. Note that all columns in songs_per_artist contain the same information (this is a consequence of using .count()).

In [23]:
top_15_artists = (songs_per_artist
                  .sort_values('streams', ascending=False)
                  .take(np.arange(15)))
top_15_artists
Out[23]:
track_name streams uri
artist_names
SZA 11 11 11
Taylor Swift 8 8 8
Morgan Wallen 6 6 6
... ... ... ...
Kanye West 2 2 2
Childish Gambino 2 2 2
NewJeans 2 2 2

15 rows × 3 columns

Using .assign and .drop, we'll create a column named 'count' that contains the same information that the other 3 columns contain, and then .get only that column (or equivalently, drop the other 3 columns).

In [24]:
# If we give .get a list, it will return a DataFrame instead of a Series!
top_15_artists = (top_15_artists
                  .assign(count=top_15_artists.get('streams'))
                  .get(['count']))
top_15_artists
Out[24]:
count
artist_names
SZA 11
Taylor Swift 8
Morgan Wallen 6
... ...
Kanye West 2
Childish Gambino 2
NewJeans 2

15 rows × 1 columns

Before calling .plot(kind='barh', y='count'), we'll sort top_15_artists by 'count' in increasing order. This is because, weirdly, Python reverses the order of rows when creating bars in horizontal bar charts.

In [25]:
top_15_artists.sort_values(by='count').plot(kind='barh', y='count');
2023-01-22T21:31:16.624600 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Vertical bar charts¶

To create a vertical bar chart, use kind='bar' instead of kind='barh'. These are typically harder to read, though.

In [26]:
top_15_artists.plot(kind='bar', y='count');
2023-01-22T21:31:16.802782 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Aside: How many streams did The Weeknd's songs on the chart receive?¶

In [27]:
(charts
 [charts.get('artist_names') == 'The Weeknd']
 .sort_values('streams')
 .plot(kind='barh', x='track_name', y='streams')
);
2023-01-22T21:31:16.964229 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

It seems like we're missing some popular songs...

How do we include featured songs, as well?¶

Answer: Using .str.contains.

In [28]:
weeknd = charts[charts.get('artist_names').str.contains('The Weeknd')]
weeknd
Out[28]:
track_name artist_names streams uri
rank
3 Creepin' (with The Weeknd & 21 Savage) Metro Boomin, The Weeknd, 21 Savage 1337320 spotify:track:2dHHgzDwk4BJdRwy9uXhTO
13 Die For You The Weeknd 794924 spotify:track:2LBqCSwhJGcFQeTHMVGwy3
76 Stargirl Interlude The Weeknd, Lana Del Rey 372624 spotify:track:5gDWsRxpJ2lZAffh5p7K0w
... ... ... ... ...
110 I Was Never There The Weeknd, Gesaffelstein 328724 spotify:track:1cKHdTo9u0ZymJdPGSh6nq
128 Blinding Lights The Weeknd 311176 spotify:track:0VjIjW4GlUZAMYd2vXMi3b
168 Call Out My Name The Weeknd 281141 spotify:track:09mEdoA6zrmBPgTEN5qXmN

8 rows × 4 columns

In [29]:
weeknd.sort_values('streams').plot(kind='barh', x='track_name', y='streams');
2023-01-22T21:31:17.134325 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Fun demo 🎵¶

In [30]:
# Run this cell, don't worry about what it does.
def show_spotify(uri):
    code = uri[uri.rfind(':')+1:]
    src = f"https://open.spotify.com/embed/track/{code}"
    width = 400
    height = 75
    display(IFrame(src, width, height))

Let's find the URI of a song we care about.¶

In [31]:
charts
Out[31]:
track_name artist_names streams uri
rank
1 Flowers Miley Cyrus 3356361 spotify:track:0yLdNVWF3Srea0uzk55zFn
2 Kill Bill SZA 2479445 spotify:track:1Qrg8KqiBpW07V7PNxwwwL
3 Creepin' (with The Weeknd & 21 Savage) Metro Boomin, The Weeknd, 21 Savage 1337320 spotify:track:2dHHgzDwk4BJdRwy9uXhTO
... ... ... ... ...
198 Major Distribution Drake, 21 Savage 266986 spotify:track:46s57QULU02Voy0Kup6UEb
199 Sun to Me Zach Bryan 266968 spotify:track:1SjsVdSXpwm1kTdYEHoPIT
200 The Real Slim Shady Eminem 266698 spotify:track:3yfqSUWxFvZELEM4PmlwIR

200 rows × 4 columns

In [32]:
favorite_song = 'Bejeweled'
In [33]:
song_uri = (charts
            [charts.get('track_name') == favorite_song]
            .get('uri')
            .iloc[0])
song_uri
Out[33]:
'spotify:track:3qoftcUZaUOncvIYjFSPdE'

Watch what happens! 🎶

In [34]:
show_spotify(song_uri)

Try it out yourself!

Bad visualizations¶

  • As mentioned earlier, visualizations allow us to easily spot trends and communicate our results with others.
  • Some visualizations make it more difficult to see the trend in data, by:
  • Adding "chart junk."
  • Using misleading axes and sizes.

Summary¶

Summary¶

  • Visualizations make it easy to extract patterns from datasets.
  • There are two main types of variables: categorical and numerical.
  • The types of the variables we're visualizing inform our choice of which type of visualization to use.
  • Today, we looked at scatter plots, line plots, and bar charts.
  • Next time: Histograms and overlaid plots.
In [ ]: