Week 4 – Least Squares, Regression, and Correlation

Lecture (January 31st) 👨‍🏫

Readings 📖

Rose, How the Idea of a ‘Normal’ Person Got Invented
Clayton, How Eugenics Shaped Statistics
Sonabend, Statistics, Eugenics, and Me
Rosenfeld, The invention of correlation.

Optional:

YouTube: Least Squares as a Maximum Likelihood estimator (highly recommended)
MathIsFun, Quincrux Explained
Galton, Kinship and Correlation
Rehmeyer, Darwin: The reluctant mathematician
PSU, Maximum Likelihood Estimation
The Normal Share of Paupers

Homework 4 (due Sunday, February 6th at 11:59PM) (solutions) 📝

Submit your answers as a PDF to Gradescope by the due date for full credit. We encourage you to discuss the readings and questions with others in the course, but all work must be your own. Remember to use Campuswire if you need guidance!

Homework 4 will be finalized by Tuesday.

Question 0

The Data Science Student Representatives created a survey for you to voice your opinion about what you love in DSC 90, and how we can improve the class and the department!

Feel free to talk about the topics covered in this course, the quality of the lectures, homeworks, and readings, and anything else you feel is relevant and constructive. As you know, this is a brand-new class, and we’d really appreciate any constructive feedback.

Please click here to complete the survey. If you aren’t able to access it, make sure you’re logged into your UCSD Google account.

Question 1

Karl Pearson, one of Galton’s disciples and collaborators, created a journal that is today known as the Annals of Human Genetics.

What was the journal originally known as?
What was the subtitle of the journal? What is the significance of that subtitle?
Why was the name of the journal eventually changed?

Question 2

This question is contained with a Jupyter Notebook, which is linked here. All of your answers (including screenshots of your code) should end up in your submitted PDF; you will not be submitting this notebook anywhere.

Question 3

This question introduces a bit of background that will be helpful in the coming week.

Recall, in lecture we considered an example where we flipped a coin 10 times and saw the sequence HTTHTTTTTH. For an arbitrary bias $p$, the probability of that sequence is $p^3 (1-p)^7$. However, $p^3 (1-p)^7$ is not the probability of seeing 3 heads and 7 tails. To compute that, we’d need to consider all of the different orders in which we could see 3 heads and 7 tails – for example, HHHTTTTTTT, HTHTHTTTTT, etc. Each of these orderings has the same probability, $p^3 (1-p)^7$.

So, $P(\text{3 heads, 7 tails}) = (\text{# of ways of flipping 3 heads and 7 tails}) \cdot p^3 (1-p)^7$. As you will learn in DSC 40A (if you haven’t already), the number of ways of flipping 3 heads and 7 tails is ${10 \choose 3}$, pronounced “10 choose 3”. If you’re not familiar with this notation, watch this video and this video.

The general probability distribution we’ve discussed here is called the binomial distribution, which determines the probability of seeing $k$ successes in $n$ trials of an experiment in which each trial succeeds with probability $p$, independent of all other trials. It says that the probability of $k$ successes is

\[P(\text{$k$ successes}) = {n \choose k} p^k (1-p)^{n-k}\]

(Previously, $k = 3$, $n = 10$, and $p$ was unknown.)

To make sure you’re comfortable with the idea, answer the following question:

Each time I call my grandma, she answers the phone with probability 0.6. I call her 5 times. What is more likely – her answering twice, or her answering four times? Find the probability of both events. Write out both answers symbolically, and then use a calculator to evaluate them as decimals. If you’d like to use Python to evaluate the result as a decimal, the function comb(n, k) in the Python package scipy.special calculates ${n \choose k}$.