Standard Units, Correlation, Regression
Concept
We use regression to make predictions about the data (based on the correlation between two variables in standard units).
Association: Any relationship or link between two variables in a scatter plot.
- Positive association: as one variable increases, the other tends to increase.
- Negative association: as one variable increases, the other tends to decrease.
Correlation coefficient : The correlation coefficient, , of two variables and measures the strength of the linear association between them (how clustered points are around a straight line).
- is always between -1 and 1.
Formulas
Standard Units
Standardize your units to compare two variables with different units (ex. height and weight).
- = value (in original units) from column x.
- = value of converted to standard units.
def standard_units(col):
"""
Standardizes the units of a column.
"""
return (col - col.mean()) / np.std(col)
Regression Line
A line used to make predictions about the value of y based on the correlation coefficient and the value of x.
- Both variables are measured in standard units.
- Always predicts that will be closer to the average than , the regression to the mean effect.
- = value of converted to standard units.
- = correlation coefficient, the strength of the linear association between and .
def calculate_r(df, x, y):
"""
Returns the average value of the product of x and y,
when both are measured in standard units.
"""
x_su = standard_units(df.get(x))
y_su = standard_units(df.get(y))
return (x_su * y_su).mean()
Converting to Original Units
Finding the slope and intercept of the regression line in original units.
Re-arranged to the form
- , mean of x, mean of y, SD of x, and SD of y are constants.
- if you have a DataFrame with two columns, you can determine all 5 values.
def slope(df, x, y):
"""
Returns the slope of the regression line between columns x and y in df (in original units).
"""
r = calculate_r(df, x, y)
return r * np.std(df.get(y)) / np.std(df.get(x))
def intercept(df, x, y):
"""
Returns the intercept of the regression line between columns x and y in df (in original units).
"""
return df.get(y).mean() - slope(df, x, y) * df.get(x).mean()
Code Example
Predicting pet weight using the regression line of the Age
and Weight
columns.
Method 1: Using SD and Mean
Convert Age
values into standard units, find SD and mean of Weight
.
x_su = standard_units(full_pets.get('Age')) # series of floats ('Age' values in standard units)
y_sd = np.std(full_pets.get('Weight'))
y_mean = full_pets.get('Weight').mean()
print("SD of y:", y_sd)
print("Mean of y:", y_mean)
Plug into and convert to original units.
def predict_weight():
# Predicts the weight of a pet that is 'age' years old.
predicted_y_su = r * x_su
return predicted_y_su * y_sd + y_mean
This function returns an array of predicted Weight
values.
Method 2: Slope-intercept Form
Calculate the correlation coefficient, slope, and intercept of the regression line.
r = calculate_r(full_pets, 'Age', 'Weight')
m = slope(full_pets, 'Age', 'Weight')
b = intercept(full_pets, 'Age', 'Weight')
print("Correlation coefficient (r):", np.round(r, 3))
print("Slope of regression line:", np.round(m, 3))
print("Intercept of regression line:", np.round(b, 3))
Correlation coefficient (r): 0.134
Slope: 1.162
Intercept: 18.704
def predict_weight2(age):
# Predicts the weight of a pet that is 'age' years old.
return m * age + b
Apply function to Age
values for an array of predicted Weight
values:
all_predictions = np.array([])
for age in full_pets.get('Age').values:
all_predictions = np.append(all_predictions, predict_weight2(age))
Plot the regression line
plt.scatter(x=full_pets.get('Age'), y=full_pets.get('Weight'))
plt.plot(full_pets.get('Age'), all_predictions, color='red')
# or plt.plot(full_pets.get('Age'), predict_weight(), color='red')
Residuals
If there is no pattern in a residual plot (patternless "cloud"), the regression line is a good linear fit.
Errors: Difference between the actual and predicted values.
- Any set of predictions have errors.
Residuals: Errors when using a regression line.
- There is one residual corresponding to each data point in the dataset.
Residual plots: The scatter plot with the variable on the -axis and residuals on the -axis.
- Residual plots describe how the error in the regression line's predictions varies.
- The correlation does not tell the full story.