Linear Regression on Facebook Impression Data w/ Root Back Transformation

In [47]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.formula.api import ols
import seaborn as sns

# Increase size of plot in jupyter 
# (you will need to run the cell twice for the size change to take effect, not sure why)
plt.rcParams["figure.figsize"] = (18,12)
sns.set_style('whitegrid')
sns.set_palette("Set2")

In [49]:

filename = 'ad_conversion.csv'
ad_conversion = pd.read_csv(filename, index_col=0)

print(ad_conversion.head())

   spent_usd  n_impressions  n_clicks
0       1.43           7350         1
1       1.82          17861         2
2       1.25           4259         1
3       1.29           4133         1
4       4.77          15615         3

Plot is cramped¶

Let’s look at impressions versus spend. If we draw the standard plot, the majority of the points are crammed into the bottom-left of the plot, making it difficult to assess whether there is a good fit or not.

In [50]:

fig = plt.figure()

sns.regplot(data = ad_conversion
           ,x = 'spent_usd'
           ,y = 'n_impressions'
           ,ci = None
           ,scatter_kws=({'alpha':0.5}))

plt.show()

Square root vs square root¶

By transforming both the variables with square roots, the data are more spread out throughout the plot, and the points follow the line fairly closely. Square roots are a common transformation when your data has a right-skewed distribution.

In [51]:

ad_conversion['sqrt_spent_usd'] = ad_conversion['spent_usd'] ** 0.5
ad_conversion['sqrt_n_impressions'] = ad_conversion['n_impressions'] ** 0.5

fig = plt.figure()

sns.regplot(data = ad_conversion
           ,x = 'sqrt_spent_usd'
           ,y = 'sqrt_n_impressions'
           ,ci = None
           ,scatter_kws=({'alpha':0.5}))

plt.show()

Modeling and predicting¶

Running the model and creating the explanatory dataset are the same as usual, but notice the use of the transformed variables in the formula and DataFrame. I also included the untransformed spent_usd variable for reference. Prediction requires an extra step. Because we took the square root of the response variable (not just the explanatory variable), the predict function will predict the square root of the number of impressions. That means that we have to undo the square root by squaring the predicted responses. Undoing the transformation of the response is called back transformation.

In [52]:

mdl_sqrt_impress_v_sqrt_spent = ols('sqrt_n_impressions ~ sqrt_spent_usd', data = ad_conversion).fit()
print(mdl_sqrt_impress_v_sqrt_spent.params)
print()

explanatory_data = pd.DataFrame({'sqrt_spent_usd' : np.arange(0,26,5)})
prediction_data = explanatory_data.assign(sqrt_n_impressions = 
                                          mdl_sqrt_impress_v_sqrt_spent.predict(explanatory_data))

prediction_data['spent_usd'] = prediction_data['sqrt_spent_usd'] ** 2
prediction_data['n_impressions'] = prediction_data['sqrt_n_impressions'] ** 2

print(prediction_data.head())

Intercept         15.319713
sqrt_spent_usd    58.241687
dtype: float64

   sqrt_spent_usd  sqrt_n_impressions  spent_usd  n_impressions
0               0           15.319713          0   2.346936e+02
1               5          306.528147         25   9.395951e+04
2              10          597.736582        100   3.572890e+05
3              15          888.945016        225   7.902232e+05
4              20         1180.153450        400   1.392762e+06

In [53]:

fig = plt.figure()

sns.regplot(data = ad_conversion
           ,x = 'sqrt_spent_usd'
           ,y = 'sqrt_n_impressions'
           ,ci = None
           ,scatter_kws=({'alpha':0.5}))

sns.scatterplot(data = prediction_data
                ,x = 'sqrt_spent_usd'
                ,y = 'sqrt_n_impressions'
                ,color = 'coral'
                ,s=250
                ,marker='s')

plt.show()

fig = plt.figure()

sns.regplot(data = ad_conversion
           ,x = 'spent_usd'
           ,y = 'n_impressions'
           ,ci = None
           ,scatter_kws=({'alpha':0.5}))

sns.scatterplot(data = prediction_data
                ,x = 'spent_usd'
                ,y = 'n_impressions'
                ,color = 'coral'
                ,s=250
                ,marker='s')

plt.show()

Transforming the response variable too¶

The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, you “back transform” the predictions.

The first step of the digital advertising workflow is this: spending money to buy ads, and counting how many people see them (the “impressions”). The next step is determining how many people click on the advert after seeing it.In this exercise, we’ll do exactly that.

In [54]:

sns.set_palette("flare")

In [55]:

# Plot using the transformed variables
sns.regplot(data=ad_conversion,y='n_clicks', x='n_impressions', scatter_kws=({'alpha':0.5}), ci=None)

plt.title('n_clicks v. n_impressions')
plt.show()

# Create qdrt_n_impressions and qdrt_n_clicks
ad_conversion["qdrt_n_impressions"] = ad_conversion["n_impressions"] ** 0.25

ad_conversion["qdrt_n_clicks"] = ad_conversion["n_clicks"] ** 0.25

plt.figure()

# Plot using the transformed variables
sns.regplot(data=ad_conversion,
            y='qdrt_n_clicks', 
            x='qdrt_n_impressions', 
            scatter_kws=({'alpha':0.5}), 
            ci=None)

plt.title('qdrt_n_clicks v. qdrt_n_impressions')
plt.show()

In [56]:

# Run a linear regression of your transformed variables
mdl_click_vs_impression_trans = ols('qdrt_n_clicks ~ qdrt_n_impressions', data = ad_conversion).fit()

explanatory_data = pd.DataFrame({"qdrt_n_impressions": np.arange(0, 3e6+1, 5e5) ** .25,
                                 "n_impressions": np.arange(0, 3e6+1, 5e5)})



# Complete prediction_data
prediction_data = explanatory_data.assign(
    qdrt_n_clicks = mdl_click_vs_impression_trans.predict(explanatory_data)
)

# Print the result
print(prediction_data)

   qdrt_n_impressions  n_impressions  qdrt_n_clicks
0            0.000000            0.0       0.071748
1           26.591479       500000.0       3.037576
2           31.622777      1000000.0       3.598732
3           34.996355      1500000.0       3.974998
4           37.606031      2000000.0       4.266063
5           39.763536      2500000.0       4.506696
6           41.617915      3000000.0       4.713520

Back transformation¶

In the previous exercise, we transformed the response variable, ran a regression, and made predictions. But we’re not done yet! In order to correctly interpret and visualize your predictions, we’ll need to do a back-transformation.

In [61]:

# Back transform qdrt_n_clicks
prediction_data["n_clicks"] = prediction_data["qdrt_n_clicks"] ** 4

# Print the result
print(prediction_data)

   qdrt_n_impressions  n_impressions  qdrt_n_clicks    n_clicks
0            0.000000            0.0       0.071748    0.000026
1           26.591479       500000.0       3.037576   85.135121
2           31.622777      1000000.0       3.598732  167.725102
3           34.996355      1500000.0       3.974998  249.659131
4           37.606031      2000000.0       4.266063  331.214159
5           39.763536      2500000.0       4.506696  412.508546
6           41.617915      3000000.0       4.713520  493.607180

In [59]:

# Plot the transformed variables
fig = plt.figure()
sns.regplot(data=ad_conversion, 
            y="qdrt_n_clicks", 
            x="qdrt_n_impressions", 
            ci=None, 
            scatter_kws=({'alpha':0.5}))

# Add a layer of your prediction points
sns.scatterplot(data=prediction_data, 
                y="qdrt_n_clicks", 
                x="qdrt_n_impressions", 
                color='gold', 
                marker='s', 
                s=250)

plt.title('qdrt_n_clicks v. qdrt_n_impressions')

plt.show()

In [60]:

# Plot the transformed variables
fig = plt.figure()
sns.regplot(data=ad_conversion, 
            y="n_clicks", 
            x="n_impressions", 
            ci=None, 
            scatter_kws=({'alpha':0.5}))

# Add a layer of your prediction points
sns.scatterplot(data=prediction_data, 
                y="n_clicks", 
                x="n_impressions", 
                color='gold', 
                marker='s', 
                s=200)

plt.title('qdrt_n_clicks v. qdrt_n_impressions')

plt.show()

Coefficient of determination¶

The coefficient of determination is a measure of how well the linear regression line fits the observed values. For simple linear regression, it is equal to the square of the correlation between the explanatory and response variables.

Here, we’ll take another look at the second stage of the advertising pipeline: modeling the click response to impressions. Two models are created: mdl_click_vs_impression_orig models n_clicks versus n_impressions. mdl_click_vs_impression_trans models n_clicks to the power of 0.25 versus n_impressions to the power of 0.25.

In [41]:

# Run a linear regression of your transformed variables
mdl_click_vs_impression_orig = ols('n_clicks ~ n_impressions', data = ad_conversion).fit()

# Print a summary of mdl_click_vs_impression_orig
print(mdl_click_vs_impression_orig.summary())
print()

# Print a summary of mdl_click_vs_impression_trans
print(mdl_click_vs_impression_trans.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               n_clicks   R-squared:                       0.892
Model:                            OLS   Adj. R-squared:                  0.891
Method:                 Least Squares   F-statistic:                     7683.
Date:                Sat, 29 Jul 2023   Prob (F-statistic):               0.00
Time:                        22:48:03   Log-Likelihood:                -4126.7
No. Observations:                 936   AIC:                             8257.
Df Residuals:                     934   BIC:                             8267.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         1.6829      0.789      2.133      0.033       0.135       3.231
n_impressions     0.0002   1.96e-06     87.654      0.000       0.000       0.000
==============================================================================
Omnibus:                      247.038   Durbin-Watson:                   0.870
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13215.277
Skew:                          -0.258   Prob(JB):                         0.00
Kurtosis:                      21.401   Cond. No.                     4.88e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          qdrt_n_clicks   R-squared:                       0.945
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                 1.590e+04
Date:                Sat, 29 Jul 2023   Prob (F-statistic):               0.00
Time:                        22:48:03   Log-Likelihood:                 193.90
No. Observations:                 936   AIC:                            -383.8
Df Residuals:                     934   BIC:                            -374.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.0717      0.017      4.171      0.000       0.038       0.106
qdrt_n_impressions     0.1115      0.001    126.108      0.000       0.110       0.113
==============================================================================
Omnibus:                       11.447   Durbin-Watson:                   0.568
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               10.637
Skew:                          -0.216   Prob(JB):                      0.00490
Kurtosis:                       2.707   Cond. No.                         52.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [42]:

# Print the coeff of determination for mdl_click_vs_impression_orig
print(f'The r-squared value for mdl_click_vs_impression_orig is:\n{mdl_click_vs_impression_orig.rsquared}')
print()

# Print the coeff of determination for mdl_click_vs_impression_trans
print(f'The r-squared value for mdl_click_vs_impression_trans is:\n{mdl_click_vs_impression_trans.rsquared}')

The r-squared value for mdl_click_vs_impression_orig is:
0.8916134973508041

The r-squared value for mdl_click_vs_impression_trans is:
0.9445272817143905

mdl_click_vs_impression_orig has a coefficient of determination of 0.89. The number of improessions explains 89% of the variability in the number of clicks.

Additionally, the coefficient of determination suggests that the mdl_click_vs_impression_trans gives a better fit.

Residual standard error¶

Residual standard error (RSE) is a measure of the typical size of the residuals. Equivalently, it’s a measure of how wrong you can expect predictions to be. Smaller numbers are better, with zero being a perfect fit to the data.

Again, we’ll look at the models from the advertising pipeline, mdl_click_vs_impression_orig and mdl_click_vs_impression_trans.

In [43]:

# Calculate mse_orig for mdl_click_vs_impression_orig
mse_orig = mdl_click_vs_impression_orig.mse_resid #mean standard error
print("MSE of original model: ",mse_orig)

# Calculate rse_orig for mdl_click_vs_impression_orig and print it
rse_orig = np.sqrt(mse_orig)
print("RSE of original model: ", rse_orig)

# Calculate mse_trans for mdl_click_vs_impression_trans
mse_trans = mdl_click_vs_impression_trans.mse_resid #mean standard error
print("MSE of transformed model: ",mse_trans)

# Calculate rse_trans for mdl_click_vs_impression_trans and print it
rse_trans = np.sqrt(mse_trans)
print("RSE of transformed model: ", rse_trans)

MSE of original model:  396.2424208189449
RSE of original model:  19.905838862478138
MSE of transformed model:  0.038772133892971475
RSE of transformed model:  0.19690640896875722

The difference between the predicted clicks and acutal clicks is typically 19.9 clicks.

The difference between the predicted clicks ^0.25 and acutal clicks ^0.25 is typically .02 clicks ^0.25.

RSE is a measure of accuracy for regression models. It even works on other other statistical model types like regression trees, so you can compare accuracy across different classes of models.

If a linear regression model is a good fit, then the residuals are approximately normally distributed, with mean zero.