Regression to the Mean w/ S&P500 Yearly Returns

Predictions and model objects¶

In this exercise, we’ll grow your regression skills as you get hands-on with model objects, understand the concept of “regression to the mean”, and learn how to transform variables in a dataset.

Plotting consecutive portfolio returns¶

Regression to the mean is also an important concept in investing. Here you’ll look at the annual returns from investing in companies in the Standard and Poor 500 index (S&P 500), in 2018 and 2019.

The sp500_yearly_returns dataset contains three columns:

variable    meaning
symbol  Stock ticker symbol uniquely identifying the company.
return_2018 A measure of investment performance in 2018.
return_2019 A measure of investment performance in 2019.
A positive number for the return means the investment increased in value; negative means it lost value.

Just as with baseball home runs, a naive prediction might be that the investment performance stays the same from year to year, lying on the y equals x line.

In [2]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.formula.api import ols
import seaborn as sns

filename = 'sp500_yearly_returns.csv'
sp500_yearly_returns = pd.read_csv(filename, index_col=0)

print(sp500_yearly_returns.head())

  symbol  return_2018  return_2019
0   AAPL    -0.053902     0.889578
1   MSFT     0.207953     0.575581
2   AMZN     0.284317     0.230278
3     FB    -0.257112     0.565718
4  GOOGL    -0.008012     0.281762

In [8]:

# Increase size of plot in jupyter 
# (you will need to run the cell twice for the size change to take effect, not sure why)
plt.rcParams["figure.figsize"] = (18,12)

In [10]:

# Create a new figure, fig
fig = plt.figure()

# Plot the first layer: y = x
plt.axline(xy1=(0,0), slope=1, linewidth=2, color="green")

# Add scatter plot with linear regression trend line
sns.regplot(data=sp500_yearly_returns
            ,x = 'return_2018'
            ,y = 'return_2019'
            , ci = None)

# Set the axes so that the distances along the x and y axes look the same
plt.axis('equal')

# Show the plot
plt.show()

Modeling consecutive returns¶

Let’s quantify the relationship between returns in 2019 and 2018 by running a linear regression and making predictions. By looking at companies with extremely high or extremely low returns in 2018, we can see if their performance was similar in 2019.

In [11]:

# Run a linear regression on return_2019 vs. return_2018 using sp500_yearly_returns
mdl_returns = ols('return_2019 ~ return_2018', data = sp500_yearly_returns).fit()

# Print the parameters
print(mdl_returns.params)

Intercept      0.321321
return_2018    0.020069
dtype: float64

In [14]:

mdl_returns = ols("return_2019 ~ return_2018", data=sp500_yearly_returns).fit()

# Create a DataFrame with return_2018 at -1, 0, and 1 
explanatory_data = pd.DataFrame({'return_2018':[-1,0,1]})

# Use mdl_returns to predict with explanatory_data
print(explanatory_data.assign(return_2019 = mdl_returns.predict(explanatory_data)))

   return_2018  return_2019
0           -1     0.301251
1            0     0.321321
2            1     0.341390

Incredible investment predictions! Investments that gained a lot in value in 2018 on average gained only a small amount in 2019. Similarly, investments that lost a lot of value in 2018 on average also gained a small amount in 2019.