Regression to the mean¶
Let’s take a short break from thinking about regression modeling, to a related concept called “regression to the mean”. Regression to the mean is a property of the data, not a type of model, but linear regression can be used to quantify its effect.
The concept¶
We already saw that each response value in our dataset is equal to the sum of a fitted value, that is, the prediction by the model, and a residual, which is how much the model missed by. Loosely speaking, these two values are the parts of the response that we’ve explained why it has that value, and the parts we couldn’t explain with our model. There are two possibilities for why we have a residual. Firstly, it could just be because our model isn’t great. Particularly in the case of simple linear regression where we only have one explanatory variable, there is often room for improvement. However, it usually isn’t possible or desirable to have a perfect model because the world contains a lot of randomness, and our model shouldn’t capture that. In particular, extreme responses are often due to randomness or luck. That means that extremes don’t persist over time, because eventually the luck runs out. This is the concept of regression to the mean. Eventually, extreme cases will look more like average cases.
Pearson’s father son dataset¶
Here’s a classic dataset on the heights of fathers and their sons, collected by Karl Pearson, the statistician who the Pearson correlation coefficient is named after. The dataset consists of over a thousand pairs of heights, and was collected as part of a nineteenth century scientific work on biological inheritance. It lets us answer the question, “do tall fathers have tall sons?”, and “do short fathers have short sons?”.
Adapted from :
https://www.kaggle.com/datasets/abhilash04/fathersandsonheight/
https://rdrr.io/cran/UsingR/man/father.son.html
https://www.rdocumentation.org/packages/UsingR/topics/father.son
The table below gives the heights of fathers and their sons, based on a famous experiment by Karl Pearson around 1903. The number of cases is 1078. Random noise was added to the original data, to produce heights to the nearest 0.1 inch.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (12,8)
sns.set_style('whitegrid')
# Locate data
file = 'Pearson.txt'
# read dat
df = pd.read_csv(file, sep='\t')
print(df.head())
Father Son 0 65.0 59.8 1 63.3 63.2 2 65.0 63.3 3 65.8 62.8 4 61.1 64.3
Scatter plot¶
Here’s a scatter plot of the sons’ heights versus the fathers’ heights. I’ve added a line where the son’s and father’s heights are equal, using plt dot axline. The first two arguments determine the intercept and slope, while the linewidth and color arguments help it stand out. I also used plt dot axis with the ‘equal’ argument so that one centimeter on the x-axis appears the same as one centimeter on the y-axis. If sons always had the same height as their fathers, all the points would lie on this green line.
fig = plt.figure()
sns.scatterplot(y = 'Son'
,x = 'Father'
,data = df)
plt.axline(xy1=(78,78) # added a line where the son's and father's heights are equal
,slope=1
,linewidth=2
,color='red')
plt.ylabel('Son\'s Heights (in)')
plt.xlabel('Father\'s Heights (in)')
plt.title('Son\'s Heights v. Father\'s Heights')
plt.axis('equal') # argument so that one centimeter on the x-axis appears the same as one centimeter on the y-axis
plt.show()

fig = plt.figure()
sns.jointplot(y = 'Son'
,x = 'Father'
,data = df
,kind = 'reg'
,ci = None
,joint_kws={'line_kws':{'color':'cyan'}} # Only regression cyan
)
plt.title('Son\'s Heights v. Father\'s Heights')
plt.show()
<Figure size 864x576 with 0 Axes>

Adding a regression line¶
Let’s add a black linear regression line to the plot using regplot. You can see that the regression line isn’t as steep as the first line. On the left of the plot, the black line is above the green line, suggesting that for very short fathers, their sons are taller than them on average. On the far right of the plot, the black line is below the green line, suggesting that for very tall fathers, their sons are shorter than them on average.
pearson_cm = pd.DataFrame()
pearson_cm['son_hght_cm'] = df['Son']*2.54
pearson_cm['father_hght_cm'] = df['Father']*2.54
fig = plt.figure()
sns.regplot(y = 'son_hght_cm'
,x = 'father_hght_cm'
,data = pearson_cm
,ci = None
,line_kws = {'color':'black'} # Only regression black
)
plt.axline(xy1=(150,150) # added a line where the son's and father's heights are equal
,slope=1
,linewidth=2
,color='red')
plt.ylabel('Son\'s Heights (cm)')
plt.xlabel('Father\'s Heights (cm)')
plt.title('Son\'s Heights v. Father\'s Heights')
plt.axis('equal')
plt.show()

Running a regression¶
Running a model quantifies the predictions of how much taller or shorter the sons will be. Here, the sons’ heights are the response variable, and the fathers’ heights are the explanatory variable.
from statsmodels.formula.api import ols
mdl_son_vs_father = ols('son_hght_cm ~ father_hght_cm', data = pearson_cm).fit()
print(mdl_son_vs_father.params)
Intercept 86.087713 father_hght_cm 0.514006 dtype: float64
Making predictions¶
Now we can make predictions. Consider the case of a really tall father, at one hundred and ninety centimeters. At least, that was really tall in the late nineteenth century. The predicted height of the son is one hundred and eighty-three centimeters. Tall, but not quite as tall as his dad. Similarly, the prediction for a one hundred and fifty-centimeter father is one hundred and sixty-three centimeters. Short, but not quite as short as his dad. In both cases, the extreme value became less extreme in the next generation — a perfect example of regression to the mean.
father_data = pd.DataFrame({'father_hght_cm': [150, 190, 200]})
print(really_tall_father)
father_hght_cm 0 150 1 190 2 200
print(mdl_son_vs_father.predict(father_data))
0 163.188600 1 183.748837 2 188.888896 dtype: float64