Linear Regression of Taiwan Real Estate using StatsModels/Seaborn

Simple Linear Regression Modeling¶

We’ll learn the basics of this popular statistical model, what regression is, and how linear and logistic regressions differ. We’ll then learn how to fit simple linear regression models with numeric and categorical explanatory variables, and how to describe the relationship between the response and explanatory variables using model coefficients.

Regression lets you predict the values of a response variable from known values of explanatory variables. Which variable you use as the response variable depends on the question you are trying to answer, but in many datasets, there will be an obvious choice for variables that would be interesting to predict. Over the next few exercises, we’ll explore a Taiwan real estate dataset with four variables.

Variable                 Meaning
dist_to_mrt_station_m   Distance to nearest MRT metro station, in meters.
n_convenience           No. of convenience stores in walking distance.
house_age_years       The age of the house, in years, in three groups.
price_twd_msq           House price per unit area, in New Taiwan dollars per meter squared.

The price_twd_msq variable will make a good response variable.

This dataset can be found at the below machine learning archive:
Real estate valuation data set. (2018). UCI Machine Learning Repository.
https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set

Visualizing two numeric variables¶

Before we can run any statistical models, it’s usually a good idea to visualize our dataset. Here, we’ll look at the relationship between house price per area and the number of nearby convenience stores using the Taiwan real estate dataset.

One challenge in this dataset is that the number of convenience stores contains integer data, causing points to overlap. To solve this, we will make the points transparent.

taiwan_real_estate is available as a pandas DataFrame.

In [79]:

# import
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.formula.api import ols # ordinary least squares   

taiwan_real_estate = pd.read_excel('Real estate valuation data set.xlsx', index_col=0)  
taiwan_real_estate.rename(columns={'X1 transaction date': 'x_date', 
                                   'X2 house age': 'house_age_years', 
                                   'X3 distance to the nearest MRT station': 'dist_to_mrt_station_m', 
                                   'X4 number of convenience stores': 'n_convenience', 
                                   'X5 latitude': 'latitude', 
                                   'X6 longitude': 'longitude',
                                   'Y house price of unit area': 'price_twd_msq'}, inplace = True)

print(taiwan_real_estate.head())

         x_date  house_age_years  dist_to_mrt_station_m  n_convenience  \
No                                                                       
1   2012.916667             32.0               84.87882             10   
2   2012.916667             19.5              306.59470              9   
3   2013.583333             13.3              561.98450              5   
4   2013.500000             13.3              561.98450              5   
5   2012.833333              5.0              390.56840              5   

    latitude  longitude  price_twd_msq  
No                                      
1   24.98298  121.54024           37.9  
2   24.98034  121.53951           42.2  
3   24.98746  121.54391           47.3  
4   24.98746  121.54391           54.8  
5   24.97937  121.54245           43.1

In [80]:

# Increase size of plot in jupyter 
# (you will need to run the cell twice for the size change to take effect, not sure why)
plt.rcParams["figure.figsize"] = (18,12)

# house price per area and the number of nearby convenience stores
sns.scatterplot(data=taiwan_real_estate, y='price_twd_msq', x='n_convenience')
plt.title("House Price of Unit Area v. N of Convenience Stores")
plt.show()

In [81]:

# Set the style to display gridlines
sns.set_style('whitegrid')

# Draw a trend line on the scatter plot of price_twd_msq vs. n_convenience
# Regplots are used to plot trendlines
sns.regplot(
         x="n_convenience",
         y="price_twd_msq",
         data=taiwan_real_estate,
         ci=None,
         scatter_kws={'alpha': 0.5}) # makes the data points 50% transparent

# Show the plot
plt.title("House Price of Unit Area v. N of Convenience Stores")
plt.show()

Estimating the slope¶

To estimate the slope, we need two points, (35,3), and (40,5). We calculate the change in y values between the points. Now we do the same for the x axis. To estimate the slope we divide the y difference by the x difference. Let’s run a linear regression to check our guess.

(40 - 35) / (5 - 3) = 2.5

Running a model¶

To run a linear regression model, you import the ols function from statsmodels.formula.api. OLS stands for ordinary least squares, which is a type of regression, and is commonly used. The function ols takes two arguments. The first argument is a formula: the response variable is written to the left of the tilde ~, and the explanatory variable is written to the right. The data argument takes the DataFrame containing the variables. To actually fit the model, you add the .fit() method to your freshly created model object. When you print the resulting model, it’s helpful to use the params attribute, which contains the model’s parameters. This will result in two coefficients. These coefficients are the Intercept and slope of the straight line. It seems our guesses were pretty close. The intercept is very close to our estimate of 2.5. The slope, indicated here as n_convenience, is 2.64, slightly higher than what we estimated.

Interpreting the model coefficients¶

That means that we expect the total house price per unit area (in New Taiwan dollars per meter squared) to be 27 plus 2.6 times the number of convenience stores. So for every additional claim, we expect the total price to increase by 2.6.

Linear regression with ols()¶

While sns.regplot() can display a linear regression trend line, it doesn’t give you access to the intercept and slope as variables, or allow you to work with the model results as variables. That means that sometimes you’ll need to run a linear regression ourselves.

Time to run our first model!

In [82]:

# Import the ols function
from statsmodels.formula.api import ols

# Create the model object
mdl_price_vs_conv = ols('price_twd_msq ~ n_convenience', data=taiwan_real_estate) #TWD is an abbreviation for Taiwan dollars.

# Fit the model
mdl_price_vs_conv = mdl_price_vs_conv.fit()

# Print the parameters of the fitted model
print(mdl_price_vs_conv.params)

Intercept        27.181105
n_convenience     2.637653
dtype: float64

Visualizing numeric vs. categorical¶

If the explanatory variable is categorical, the scatter plot that we used before to visualize the data doesn’t make sense. Instead, a good option is to draw a histogram for each category. The Taiwan real estate dataset has a categorical variable in the form of the age of each house. The ages have been split into 3 groups: 0 to 15 years, 15 to 30 years, and 30 to 45 years. taiwan_real_estate is available

In [83]:

# Use pd.cut to create a categorical variable from a numeric column. 
# Define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.
bins = [0, 15, 30, 45] # np.inf is not needed: no overflow bin needed
names = ['0 to 15', '15 to 30', '30 to 45']

taiwan_real_estate['house_age_years_group'] = pd.cut(taiwan_real_estate['house_age_years'], bins, labels=names)

print(taiwan_real_estate.dtypes)

x_date                    float64
house_age_years           float64
dist_to_mrt_station_m     float64
n_convenience               int64
latitude                  float64
longitude                 float64
price_twd_msq             float64
house_age_years_group    category
dtype: object

In [84]:

print(taiwan_real_estate.head())

         x_date  house_age_years  dist_to_mrt_station_m  n_convenience  \
No                                                                       
1   2012.916667             32.0               84.87882             10   
2   2012.916667             19.5              306.59470              9   
3   2013.583333             13.3              561.98450              5   
4   2013.500000             13.3              561.98450              5   
5   2012.833333              5.0              390.56840              5   

    latitude  longitude  price_twd_msq house_age_years_group  
No                                                            
1   24.98298  121.54024           37.9              30 to 45  
2   24.98034  121.53951           42.2              15 to 30  
3   24.98746  121.54391           47.3               0 to 15  
4   24.98746  121.54391           54.8               0 to 15  
5   24.97937  121.54245           43.1               0 to 15

In [85]:

# Histograms of price_twd_msq with 10 bins, split by the age of each house
disp = sns.displot(data=taiwan_real_estate,
         x='price_twd_msq',
         bins=10,
         col='house_age_years_group',
         col_wrap=2)

#move overall title up
disp.fig.subplots_adjust(top=.9)

# Show the plot
disp.fig.suptitle("House Price per Unit Area by The Age of The House")
plt.show()

Calculating means by category¶

A good way to explore categorical variables further is to calculate summary statistics for each category. For example, we can calculate the mean and median of our response variable, grouped by a categorical variable. As such, we can compare each category in more detail.

Here, we’ll look at grouped means for the house prices in the Taiwan real estate dataset. This will help you understand the output of a linear regression with a categorical variable.

In [86]:

# Calculate the mean of price_twd_msq, grouped by house age
mean_price_by_age = taiwan_real_estate.groupby(by=['house_age_years_group'])['price_twd_msq'].mean()

# Print the result
print(mean_price_by_age)

house_age_years_group
0 to 15     40.404000
15 to 30    32.643750
30 to 45    37.812766
Name: price_twd_msq, dtype: float64

Linear regression with a categorical explanatory variable¶

The means of each category will also be the coefficients of a linear regression model with one categorical variable. We’ll prove that in this exercise.

To run a linear regression model with categorical explanatory variables, we can use the same code as with numeric explanatory variables. The coefficients returned by the model are different, however. Here we’ll run a linear regression on the Taiwan real estate dataset.

In [87]:

# Create the model, fit it
mdl_price_vs_age = ols('price_twd_msq ~ house_age_years_group', data=taiwan_real_estate).fit()

# Print the parameters of the fitted model
print(mdl_price_vs_age.params)

Intercept                            40.404000
house_age_years_group[T.15 to 30]    -7.760250
house_age_years_group[T.30 to 45]    -2.591234
dtype: float64

In [88]:

# Update the model formula to remove the intercept
mdl_price_vs_age0 = ols("price_twd_msq ~ house_age_years_group + 0", data=taiwan_real_estate).fit()

# Print the parameters of the fitted model
print(mdl_price_vs_age0.params)

house_age_years_group[0 to 15]     40.404000
house_age_years_group[15 to 30]    32.643750
house_age_years_group[30 to 45]    37.812766
dtype: float64

Predicting house prices¶

Perhaps the most useful feature of statistical models like linear regression is that you can make predictions. That is, you specify values for each of the explanatory variables, feed them to the model, and get a prediction for the corresponding response variable. The code flow is as follows.

explanatory_data = pd.DataFrame({"explanatory_var": list_of_values})
predictions = model.predict(explanatory_data)
prediction_data = explanatory_data.assign(response_var=predictions)

Here, we’ll make predictions for the house prices in the Taiwan real estate dataset.

In [89]:

# Import numpy with alias np
import numpy as np

# Create the explanatory_data 
explanatory_data = pd.DataFrame({'n_convenience': np.arange(0,11)})

# Print it
print(explanatory_data)

    n_convenience
0               0
1               1
2               2
3               3
4               4
5               5
6               6
7               7
8               8
9               9
10             10

In [90]:

# Import numpy with alias np
import numpy as np

# Create explanatory_data 
explanatory_data = pd.DataFrame({'n_convenience': np.arange(0, 11)})

# Use mdl_price_vs_conv to predict with explanatory_data, call it price_twd_msq
price_twd_msq = mdl_price_vs_conv.predict(explanatory_data)

# Print it
print(price_twd_msq)

0     27.181105
1     29.818758
2     32.456412
3     35.094065
4     37.731719
5     40.369372
6     43.007026
7     45.644679
8     48.282332
9     50.919986
10    53.557639
dtype: float64

In [91]:

# Import numpy with alias np
import numpy as np

# Create explanatory_data 
explanatory_data = pd.DataFrame({'n_convenience': np.arange(0, 11)})

# Use mdl_price_vs_conv to predict with explanatory_data, call it price_twd_msq
price_twd_msq = mdl_price_vs_conv.predict(explanatory_data)

# Create prediction_data
prediction_data = explanatory_data.assign(
    price_twd_msq = price_twd_msq)

# Print the result
print(prediction_data)

    n_convenience  price_twd_msq
0               0      27.181105
1               1      29.818758
2               2      32.456412
3               3      35.094065
4               4      37.731719
5               5      40.369372
6               6      43.007026
7               7      45.644679
8               8      48.282332
9               9      50.919986
10             10      53.557639

In [93]:

# Create a new figure, fig
fig = plt.figure()

sns.regplot(x="n_convenience",
            y="price_twd_msq",
            data=taiwan_real_estate,
            ci=None,
            scatter_kws=({'alpha':0.5}))

# Add a scatter plot layer to the regplot
sns.scatterplot(data=prediction_data
                ,x="n_convenience"
                ,y="price_twd_msq"
                ,marker='s'
                ,color='red'
                ,s=200
                )

# Show the layered plot
plt.title("House Price of Unit Area v. N of Convenience Stores")
plt.show()

The limits of prediction¶

In the last exercise, we made predictions on some sensible, could-happen-in-real-life, situations. That is, the cases when the number of nearby convenience stores were between zero and ten. To test the limits of the model’s ability to predict, try some impossible situations.

Use the console to try predicting house prices from mdl_price_vs_conv when there are -1 convenience stores. Do the same for 2.5 convenience stores. What happens in each case? Will see that the model will successfully give us predictions about cases that are impossible in real life.

In [94]:

# Define a DataFrame impossible
impossible = pd.DataFrame({'n_convenience': (-1, 2.5)})

In [95]:

prediction_data = mdl_price_vs_conv.predict(impossible)

explan_predict_data = impossible.assign(price_twd_msq = prediction_data)

print(explan_predict_data)

sns.regplot(    y = 'price_twd_msq'
                ,x = 'n_convenience'
                ,data = taiwan_real_estate
                ,ci=None)

sns.scatterplot (y = 'price_twd_msq'
                ,x = 'n_convenience'
                ,data = explan_predict_data
                ,s=200
                ,color='pink'
                ,marker='s')

plt.title('Impossible Predictions: Price Twd v N of Convenience Stores')
plt.show()

   n_convenience  price_twd_msq
0           -1.0      24.543451
1            2.5      33.775238

Extracting model elements¶

The model object created by ols() contains many elements. In order to perform further analysis on the model results, you need to extract its useful bits. The model coefficients, the fitted values, and the residuals are perhaps the most important pieces of the linear model object.

In [96]:

# create the model object of price as dependent and covenience stores as the independent variable
mdl_price_vs_conv = ols('price_twd_msq ~ n_convenience', data=taiwan_real_estate).fit()
print(mdl_price_vs_conv.params)

Intercept        27.181105
n_convenience     2.637653
dtype: float64

In [97]:

print(mdl_price_vs_conv.fittedvalues)

No
1      53.557639
2      50.919986
3      40.369372
4      40.369372
5      40.369372
         ...    
410    27.181105
411    50.919986
412    45.644679
413    40.369372
414    50.919986
Length: 414, dtype: float64

In [98]:

# https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.resid.html
print(mdl_price_vs_conv.resid)

No
1     -15.657639
2      -8.719986
3       6.930628
4      14.430628
5       2.730628
         ...    
410   -11.781105
411    -0.919986
412    -5.044679
413    12.130628
414    12.980014
Length: 414, dtype: float64

In [99]:

print(mdl_price_vs_conv.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          price_twd_msq   R-squared:                       0.326
Model:                            OLS   Adj. R-squared:                  0.324
Method:                 Least Squares   F-statistic:                     199.3
Date:                Tue, 19 Dec 2023   Prob (F-statistic):           3.41e-37
Time:                        14:03:49   Log-Likelihood:                -1586.0
No. Observations:                 414   AIC:                             3176.
Df Residuals:                     412   BIC:                             3184.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        27.1811      0.942     28.857      0.000      25.330      29.033
n_convenience     2.6377      0.187     14.118      0.000       2.270       3.005
==============================================================================
Omnibus:                      171.927   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1417.242
Skew:                           1.553   Prob(JB):                    1.78e-308
Kurtosis:                      11.516   Cond. No.                         8.87
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Manually predicting house prices¶

You can manually calculate the predictions from the model coefficients. When making predictions in real life, it is better to use .predict(), but doing this manually is helpful to reassure yourself that predictions aren’t magic – they are simply arithmetic.

In fact, for a simple linear regression, the predicted value is just the intercept plus the slope times the explanatory variable.

response = intercept + slope * explanatory

In [100]:

coeffs = mdl_price_vs_conv.params
intercept = coeffs[0]
print(intercept)

27.18110478147242

In [101]:

slope = coeffs[1]
print(slope)

2.6376534634043725

In [102]:

explanatory_data = taiwan_real_estate['n_convenience']
df = pd.DataFrame()

response_prediction = intercept + (slope * explanatory_data)
df['response_prediction']      = response_prediction
df['fittedvalues']             = mdl_price_vs_conv.fittedvalues
df['predict_explanatory_data'] = mdl_price_vs_conv.predict(explanatory_data)

print(df)

     response_prediction  fittedvalues  predict_explanatory_data
No                                                              
1              53.557639     53.557639                 53.557639
2              50.919986     50.919986                 50.919986
3              40.369372     40.369372                 40.369372
4              40.369372     40.369372                 40.369372
5              40.369372     40.369372                 40.369372
..                   ...           ...                       ...
410            27.181105     27.181105                 27.181105
411            50.919986     50.919986                 50.919986
412            45.644679     45.644679                 45.644679
413            40.369372     40.369372                 40.369372
414            50.919986     50.919986                 50.919986

[414 rows x 3 columns]

In [103]:

df2 = pd.DataFrame()
df2['n_convenience'] = explanatory_data
df2['fitted_values'] = mdl_price_vs_conv.fittedvalues
df2['residuals'] = mdl_price_vs_conv.resid

print(df2.head())

    n_convenience  fitted_values  residuals
No                                         
1              10      53.557639 -15.657639
2               9      50.919986  -8.719986
3               5      40.369372   6.930628
4               5      40.369372  14.430628
5               5      40.369372   2.730628

In [104]:

# Graph Residuals
# reference: https://stackoverflow.com/questions/51220918/python-plot-residuals-on-a-fitted-model

y = df2['fitted_values'] 
x = df2['n_convenience']
dy = df2['residuals']

fig, ax = plt.subplots()
ax.plot(x,y)
ax.scatter(x,y+dy)

ax.vlines(x,y,y+dy)

plt.ylabel('Price TWD')
plt.xlabel('N of Convenience')
plt.title('Plot of Residuals for OLS of Price TWD v N of Convenience')
plt.show()

Transforming the explanatory variable¶

If there is no straight-line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both of the variables. Here, you’ll look at transforming the explanatory variable.

You’ll take another look at the Taiwan real estate dataset, this time using the distance to the nearest MRT (metro) station as the explanatory variable. You’ll use code to make every commuter’s dream come true: shortening the distance to the metro station by taking the square root. Take that, geography!

In [110]:

# Plot using the transformed variable
sns.regplot(y = 'price_twd_msq'
            ,x= 'dist_to_mrt_station_m'
            ,data = taiwan_real_estate
            ,ci = None
            ,scatter_kws = ({'alpha':0.5}))

plt.show()

In [111]:

# Create sqrt_dist_to_mrt_m
taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_station_m"])

plt.figure()

# Plot using the transformed variable
sns.regplot(y = 'price_twd_msq'
            ,x= 'sqrt_dist_to_mrt_m'
            ,data = taiwan_real_estate
            ,ci = None
            ,scatter_kws = ({'alpha':0.5}))
plt.show()

In [107]:

mdl_price_sqrt_dist = ols('price_twd_msq ~ sqrt_dist_to_mrt_m', data = taiwan_real_estate).fit()
print(mdl_price_sqrt_dist.params)

Intercept             55.225885
sqrt_dist_to_mrt_m    -0.604296
dtype: float64

In [114]:

explanatory_data = pd.DataFrame({'sqrt_dist_to_mrt_m' : np.arange(0,81,10),
                                 'dist_to_mrt_station_m' : np.arange(0,81,10) **2})
prediction_data  = explanatory_data.assign(price_twd_msq = mdl_price_sqrt_dist.predict(explanatory_data))

fig = plt.figure()

sns.regplot(y = 'price_twd_msq'
            ,x= 'sqrt_dist_to_mrt_m'
            ,data = taiwan_real_estate
            ,ci = None
            ,scatter_kws = ({'alpha':0.5}))

sns.scatterplot(y = 'price_twd_msq'
                ,x= 'sqrt_dist_to_mrt_m'
                ,data = prediction_data
                ,marker='s'
                ,color='coral' # https://matplotlib.org/stable/gallery/color/named_colors.html
                ,s=200)
plt.show()

In [115]:

fig = plt.figure()

sns.regplot(y = 'price_twd_msq'
            ,x= 'dist_to_mrt_station_m'
            ,data = taiwan_real_estate
            ,ci = None
            ,scatter_kws = ({'alpha':0.5}))

sns.scatterplot(y = 'price_twd_msq'
                ,x= 'dist_to_mrt_station_m'
                ,data = prediction_data
                ,marker='s'
                ,color='coral' # https://matplotlib.org/stable/gallery/color/named_colors.html
                ,s=200)
plt.show()

Drawing diagnostic plots¶

It’s time for us to draw these diagnostic plots ourselves and the model of house prices versus number of convenience stores.

Let’s create the residuals versus fitted values plot and add a lowess argument to visualize the trend of the residuals.

In [116]:

# Plot the residuals vs. fitted values
sns.residplot(x='n_convenience', y='price_twd_msq', data=taiwan_real_estate, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

# Show the plot
plt.show()

This first diagnostic plot is of residuals versus fitted values. The blue line is a LOWESS trend line, which is a smooth curve following the data. These aren’t good for making predictions but are useful for visualizing trends. If residuals met the assumption that they are normally distributed with mean zero, then the trend line should closely follow the y equals zero line on the plot.

In [117]:

# Import qqplot
from statsmodels.api import qqplot

# Create the Q-Q plot of the residuals
qqplot(data=mdl_price_vs_conv.resid, fit=True, line="45")

# Show the plot
plt.show()

This second diagnostic plot is called a Q-Q plot. It shows whether or not the residuals follow a normal distribution. On the x-axis, the points are quantiles from the normal distribution. On the y-axis, you get the sample quantiles, which are the quantiles derived from your dataset. It sounds technical, but interpreting this plot is straightforward. If the points track along the straight line, they are normally distributed. If not, they aren’t.

In [118]:

# We first need to extract the normalized residuals from the model, 
# which you can get by using the get_influence method, then accessing the resid_studentized_internal attribute. 
model_norm_residuals = mdl_price_vs_conv.get_influence().resid_studentized_internal

# We then take the absolute values and take the square root of these normalized residuals to standardize them. 
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# Create the scale-location plot
sns.regplot(x=mdl_price_vs_conv.fittedvalues, y=model_norm_residuals_abs_sqrt, ci=None, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Sqrt of abs val of stdized residuals")

# Show the plot
plt.show()

This third plot shows the square root of the standardized residuals versus the fitted values. It’s often called a scale-location plot, because that’s easier to say. Where the first plot showed whether or not the residuals go positive or negative as the fitted values change, this plot shows whether the size of the residuals gets bigger or smaller.

In [119]:

taiwan_real_estate_sorted = taiwan_real_estate.sort_values(by=['sqrt_dist_to_mrt_m'], ascending=False)
taiwan_real_estate_leverage = taiwan_real_estate_sorted.head()

print(taiwan_real_estate_leverage)

          x_date  house_age_years  dist_to_mrt_station_m  n_convenience  \
No                                                                        
348  2013.583333             17.4               6488.021              1   
117  2013.000000             30.9               6396.283              1   
250  2012.833333             18.0               6306.153              1   
256  2013.416667             31.5               5512.038              1   
9    2013.500000             31.7               5512.038              1   

     latitude  longitude  price_twd_msq house_age_years_group  \
No                                                              
348  24.95719  121.47353           11.2              15 to 30   
117  24.94375  121.47883           12.2              30 to 45   
250  24.95743  121.47516           15.0              15 to 30   
256  24.95095  121.48458           17.4              30 to 45   
9    24.95095  121.48458           18.8              30 to 45   

     sqrt_dist_to_mrt_m  
No                       
348           80.548253  
117           79.976765  
250           79.411290  
256           74.243101  
9             74.243101

In [120]:

# Create sqrt_dist_to_mrt_m
taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_station_m"])

plt.figure()

# Plot using the transformed variable
sns.regplot(data=taiwan_real_estate,
            y='price_twd_msq',
            x='sqrt_dist_to_mrt_m',
            ci=None,
            scatter_kws=({'alpha':0.5}))

# Plot using the transformed variable
sns.scatterplot(data=taiwan_real_estate_leverage,
            y='price_twd_msq',
            x='sqrt_dist_to_mrt_m',
            color='red',
            s=200)


plt.title('House price per unit area in New Taiwan dollars per meter squared. vs. Square Root of Distance to nearest MRT metro station, in meters')
plt.show()

Leverage¶

Leverage measures how unusual or extreme the explanatory variables are for each observation. Very roughly, high leverage means that the explanatory variable has values that are different from other points in the dataset. In the case of simple linear regression, where there is only one explanatory value, this typically means values with a very high or very low explanatory value.

Observations with a large distance to the nearest MRT station have the highest leverage, because most of the observations have a short distance, so long distances are more extreme.

In [123]:

# Create summary_info
summary_info = mdl_price_sqrt_dist.get_influence().summary_frame()

# Add the hat_diag column to taiwan_real_estate, name it leverage
taiwan_real_estate["leverage"] = summary_info['hat_diag']

# Sort taiwan_real_estate by leverage in descending order and print the head
print(taiwan_real_estate.sort_values(by=['leverage'], ascending=False).head())

          x_date  house_age_years  dist_to_mrt_station_m  n_convenience  \
No                                                                        
348  2013.583333             17.4               6488.021              1   
117  2013.000000             30.9               6396.283              1   
250  2012.833333             18.0               6306.153              1   
256  2013.416667             31.5               5512.038              1   
9    2013.500000             31.7               5512.038              1   

     latitude  longitude  price_twd_msq house_age_years_group  \
No                                                              
348  24.95719  121.47353           11.2              15 to 30   
117  24.94375  121.47883           12.2              30 to 45   
250  24.95743  121.47516           15.0              15 to 30   
256  24.95095  121.48458           17.4              30 to 45   
9    24.95095  121.48458           18.8              30 to 45   

     sqrt_dist_to_mrt_m  leverage  
No                                 
348           80.548253  0.026665  
117           79.976765  0.026135  
250           79.411290  0.025617  
256           74.243101  0.021142  
9             74.243101  0.021142

Influence¶

Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if we ran a linear regression on all data points except that point, compared to running a linear regression on the whole dataset.

The standard metric for influence is Cook’s distance, which calculates influence based on the residual size and the leverage of the point. We can see the same model as last time: house price versus the square root of distance from the nearest MRT station in the Taiwan real estate dataset.

In [125]:

# Create summary_info
summary_info = mdl_price_sqrt_dist.get_influence().summary_frame()

# Add the hat_diag column to taiwan_real_estate, name it leverage
taiwan_real_estate["leverage"] = summary_info["hat_diag"]

# Add the cooks_d column to taiwan_real_estate, name it cooks_dist
taiwan_real_estate["cooks_dist"] = summary_info['cooks_d']

# Sort taiwan_real_estate by cooks_dist in descending order and print the head.
print(taiwan_real_estate.sort_values(by=['cooks_dist'], ascending=False).head())

          x_date  house_age_years  dist_to_mrt_station_m  n_convenience  \
No                                                                        
271  2013.333333             10.8               252.5822              1   
149  2013.500000             16.4              3780.5900              0   
229  2013.416667             11.9              3171.3290              0   
221  2013.333333             37.2               186.5101              9   
114  2013.333333             14.8               393.2606              6   

     latitude  longitude  price_twd_msq house_age_years_group  \
No                                                              
271  24.97460  121.53046          117.5               0 to 15   
149  24.93293  121.51203           45.1              15 to 30   
229  25.00115  121.51776           46.6               0 to 15   
221  24.97703  121.54265           78.3              30 to 45   
114  24.96172  121.53812            7.6               0 to 15   

     sqrt_dist_to_mrt_m  leverage  cooks_dist  
No                                             
271           15.892835  0.003849    0.115549  
149           61.486503  0.012147    0.052440  
229           56.314554  0.009332    0.035384  
221           13.656870  0.004401    0.025123  
114           19.830799  0.003095    0.022813

Leverage and influence are important concepts for determining whether your model is overly affected by some unusual data points.