Lecture 6 – Data Visualization 📈

DSC 10, Spring 2023

Announcements

Don't forget about these resources!

Agenda

Aside: Keyboard shortcuts

There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).

Particularly useful shortcuts:

Action Keyboard shortcut
Run cell + jump to next cell SHIFT + ENTER
Save the notebook CTRL/CMD + S
Create new cell above/below A/B
Delete cell DD

Recap: GroupBy

Run the cell below to load in the requests DataFrame from last class.

Which neighborhood had the most requests?

Example: Number of different services

How do we find the number of different services requested in each neighborhood?

As always when using groupby, there are two steps:

  1. Choose a column to group by.
    • Here, 'neighborhood' seems like a good choice.
  1. Choose an aggregation method.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observation #4

The column names of the output of .groupby don't make sense when using the .count() aggregation method.

Consider dropping unneeded columns and renaming columns as follows:

  1. Use .assign to create a new column containing the same values as the old column(s).
  2. Use .drop(columns=list_of_column_labels) to drop the old column(s). Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!

Why visualize?

Run these cells to load the Little Women data from Lecture 1.

Little Women

In Lecture 1, we were able to answer questions about the plot of Little Women without having to read the novel and without having to understand Python code. Some of those questions included:

We answered these questions from a data visualization alone!

Napoleon's March

"Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812." (source)

Why visualize?

Terminology

Individuals and variables

Types of variables

There are two main types of variables:

Note that here, "variable" does not mean a variable in Python, but rather it means a column in a DataFrame.

Examples of numerical variables

Examples of categorical variables

Concept Check ✅ – Answer at cc.dsc10.com

Which of these is not a numerical variable?

A. Fuel economy in miles per gallon.

B. Number of quarters at UCSD.

C. College at UCSD (Sixth, Seventh, etc).

D. Bank account number.

E. More than one of these are not numerical variables.

Types of visualizations

The type of visualization we create depends on the kinds of variables we're visualizing.

We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

Scatter plots

Dataset of 50 top-grossing actors

Column Contents

'Actor'|Name of actor 'Total Gross'| Total gross domestic box office receipt, in millions of dollars, of all of the actor’s movies 'Number of Movies'| The number of movies the actor has been in 'Average per Movie'| Total gross divided by number of movies '#1 Movie'| The highest grossing movie the actor has been in 'Gross'| Gross domestic box office receipt, in millions of dollars, of the actor’s #1 Movie

Scatter plots

What is the relationship between 'Number of Movies' and 'Total Gross'?

Scatter plots

Scatter plots

What is the relationship between 'Number of Movies' and 'Average per Movie'?

Note that in the above plot, there's a negative association and an outlier.

Who was in 60 or more movies?

Who is the outlier?

Whoever they are, they made very few, high grossing movies.

Line plots 📉

Dataset aggregating movies by year

Column Content

'Year'| Year 'Total Gross in Billions'| Total domestic box office gross, in billions of dollars, of all movies released 'Number of Movies'| Number of movies released '#1 Movie'| Highest grossing movie

Line plots

How has the number of movies changed over time? 🤔

Line plots

Plotting tip

Zooming in

We can create a line plot of just 2000 onwards by querying movies_by_year before calling .plot.

What do you think explains the declines around 2008 and 2020?

How did this affect total gross?

What was the top grossing movie of 2018?

Extra video on line plots

If you're curious how line plots work under the hood, watch this video we made a few quarters ago.

Bar charts 📊

Dataset of the top 200 songs in the US on Spotify as of Thursday (4/13/2023)

Downloaded from here – check it out!

Bar charts

How many streams do the top 10 songs have?

Bar charts

Aside: How many streams did The Weeknd's songs on the chart receive?

It seems like we're missing some popular songs...

How do we include songs with other artists, as well?

Answer: Using .str.contains.

Fun demo 🎵

Let's find the URI of a song we care about.

Watch what happens! 🎶

Try it out yourself!

Summary

Summary