Linear Regression of Todorov’s Finnish Fish Catch Data 2007

Fish Catch Data Set¶

Description: The Fish Catch data set contains measurements on 159 fish caught in the lake Laengelmavesi, Finland.

Usage: Data(fish)
Format: A data frame with 159 observations on the following 7 variables.
Weight: Weight of the fish (in grams)
Length1: Length from the nose to the beginning of the tail (in cm)
Length2: Length from the nose to the notch of the tail (in cm)
Length3: Length from the nose to the end of the tail (in cm)
Height: Maximal height as % of Length3
Width: Maximal width as % of Length3
Species: Represents the grouping structure: the 7 species are 1=Bream, 2=Whitewish, 3=Roach, 4=Parkki, 5=Smelt,
6=Pike, 7=Perch.

Details
The Fish Catch data set contains measurements on 159 fish caught in the lake Laengelmavesi, Finland. For the 159 fishes of 7 species the weight, length, height, and width were measured. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail and from the nose to the end of its tail. The height and width are calculated as percentages of the third length variable. This results in 6 observed variables, Weight, Length1, Length2, Length3, Height, Width. Observation 14 has a missing value in variable Weight, therefore this observation is usually excluded from the analysis. The last variable, Species, represents the grouping structure: the 7 species are 1=Bream, 2=Whitewish, 3=Roach, 4=Parkki, 5=Smelt, 6=Pike, 7=Perch. This data set was also analyzed in the context of robust Linear Discriminant Analysis by Todorov (2007), Todorov and Pires (2007).

Source:
Journal of Statistical Education, Fish Catch Data Set, [http://www.amstat.org/publications/jse/datasets/fishcatch.txt] accessed August, 2006.

References:
Todorov, V. (2007 Robust selection of variables in linear discriminant analysis, Statistical Methods and Applications, 15, 395–407, doi:10.1007/s10260-006-0032-6.

Todorov, V. and Pires, A.M. (2007) Comparative performance of several robust linear discriminant analysis methods, REVSTAT Statistical Journal, 5, 63–83.

In [2]:

import pyreadr

orderedDict = pyreadr.read_r('fish.rda')
# Data:        https://github.com/valentint/rrcov/blob/master/data/fish.rda
# Definitions: https://search.r-project.org/CRAN/refmans/rrcov/html/fish.html

print(orderedDict)

OrderedDict([('fish',           Weight  Length1  Length2  Length3  Height  Width  Species
rownames                                                           
1          242.0     23.2     25.4     30.0    38.4   13.4        1
2          290.0     24.0     26.3     31.2    40.0   13.8        1
3          340.0     23.9     26.5     31.1    39.8   15.1        1
4          363.0     26.3     29.0     33.5    38.0   13.3        1
5          430.0     26.5     29.0     34.0    36.6   15.1        1
...          ...      ...      ...      ...     ...    ...      ...
155       1100.0     39.0     42.0     44.6    28.7   15.4        7
156       1000.0     39.8     43.0     45.2    26.4   16.1        7
157       1100.0     40.1     43.0     45.5    27.5   16.3        7
158       1000.0     40.2     43.5     46.0    27.4   17.7        7
159       1000.0     41.1     44.0     46.6    26.8   16.3        7

[159 rows x 7 columns])])

In [3]:

import pandas as pd

# df = pd.DataFrame(orderedDict, columns=orderedDict, index=[0])

df = pd.DataFrame() #init empty df

for k,v in orderedDict.items():
    print(v.head())
    df = pd.concat([df, v])
    
# https://stackoverflow.com/questions/44365209/generate-a-pandas-dataframe-from-ordereddict

          Weight  Length1  Length2  Length3  Height  Width  Species
rownames                                                           
1          242.0     23.2     25.4     30.0    38.4   13.4        1
2          290.0     24.0     26.3     31.2    40.0   13.8        1
3          340.0     23.9     26.5     31.1    39.8   15.1        1
4          363.0     26.3     29.0     33.5    38.0   13.3        1
5          430.0     26.5     29.0     34.0    36.6   15.1        1

In [20]:

print(df)

          Weight  Length1  Length2  Length3  Height  Width  Species
rownames                                                           
1          242.0     23.2     25.4     30.0    38.4   13.4        1
2          290.0     24.0     26.3     31.2    40.0   13.8        1
3          340.0     23.9     26.5     31.1    39.8   15.1        1
4          363.0     26.3     29.0     33.5    38.0   13.3        1
5          430.0     26.5     29.0     34.0    36.6   15.1        1
...          ...      ...      ...      ...     ...    ...      ...
155       1100.0     39.0     42.0     44.6    28.7   15.4        7
156       1000.0     39.8     43.0     45.2    26.4   16.1        7
157       1100.0     40.1     43.0     45.5    27.5   16.3        7
158       1000.0     40.2     43.5     46.0    27.4   17.7        7
159       1000.0     41.1     44.0     46.6    26.8   16.3        7

[159 rows x 7 columns]

In [4]:

SpeciesDict = {1:'Bream', 2:'Whitewish', 3:'Roach', 4:'Parkki', 5:'Smelt', 
         6:'Pike', 7:'Perch'}

df = df.rename(columns={'Weight':'mass_g', 
                        'Length1':'length_cm', 
                        'Length2':'Length2', 
                        'Length3':'Length3', 
                        'Height':'Height', 
                        'Width':'Width', 
                        'Species':'SpeciesCd'})

df['species'] = df['SpeciesCd'].map(SpeciesDict)

fish = df[df['SpeciesCd'].isin([1,3,6,7])]
print(fish.iloc[:14])

          mass_g  length_cm  Length2  Length3  Height  Width  SpeciesCd  \
rownames                                                                  
1          242.0       23.2     25.4     30.0    38.4   13.4          1   
2          290.0       24.0     26.3     31.2    40.0   13.8          1   
3          340.0       23.9     26.5     31.1    39.8   15.1          1   
4          363.0       26.3     29.0     33.5    38.0   13.3          1   
5          430.0       26.5     29.0     34.0    36.6   15.1          1   
6          450.0       26.8     29.7     34.7    39.2   14.2          1   
7          500.0       26.8     29.7     34.5    41.1   15.3          1   
8          390.0       27.6     30.0     35.0    36.2   13.4          1   
9          450.0       27.6     30.0     35.1    39.9   13.8          1   
10         500.0       28.5     30.7     36.2    39.3   13.7          1   
11         475.0       28.4     31.0     36.2    39.4   14.1          1   
12         500.0       28.7     31.0     36.2    39.7   13.3          1   
13         500.0       29.1     31.5     36.4    37.8   12.0          1   
14           NaN       29.5     32.0     37.3    37.3   13.6          1   

         species  
rownames          
1          Bream  
2          Bream  
3          Bream  
4          Bream  
5          Bream  
6          Bream  
7          Bream  
8          Bream  
9          Bream  
10         Bream  
11         Bream  
12         Bream  
13         Bream  
14         Bream

Visualizing 1 numeric and 1 categorical variable¶

To visualize the data, scatter plots aren’t ideal because species is categorical. Instead, we can draw a histogram for each of the species. To give a separate panel to each species, we use seaborn’s displot function. This takes a DataFrame as the data argument, the variable of interest as x, and the variable you want to split on as col. It also takes an optional col_wrap argument to specify the number of plots per row. Because the dataset is fairly small, we also set the bins argument to nine. By default, displot creates histograms.

In [29]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (18,12)

sns.set_style('whitegrid')

sns.displot( data=fish
            ,col_wrap=2
            ,bins=9
            ,kind='hist'
            ,x='mass_g'
            ,col='species')

plt.suptitle('N of Mass v. Species', y=1.02)
plt.show()

Summary statistics: mean mass by species¶

Let’s calculate some summary statistics. First we group by species, then we calculate their mean masses. You can see that the mean mass of a bream is six hundred and eighteen grams. The mean mass for a perch is three hundred and eighty two grams, and so on.

In [7]:

summary_stats = fish.groupby('species')['mass_g'].mean()
print(summary_stats)

species
Bream    626.000000
Perch    382.239286
Pike     718.705882
Roach    152.050000
Name: mass_g, dtype: float64

Linear regression¶

Let’s run a linear regression using mass as the response variable and species as the explanatory variable. The syntax is the same: you call ols(), passing a formula with the response variable on the left and the explanatory variable on the right, and setting the data argument to the DataFrame. We fit the model using the fit method, and retrieve the parameters using dot params on the fitted model. This time we have four coefficients: an intercept, and one for three of the fish species. A coefficient for bream is missing, but the number for the intercept looks familiar. The intercept is the mean mass of the bream that you just calculated. You might wonder what the other coefficients are, and why perch has a negative coefficient, since fish masses can’t be negative.

In [8]:

from statsmodels.formula.api import ols
mdl_mass_v_species = ols('mass_g ~ species',data=fish).fit()
print(mdl_mass_v_species.params)

Intercept           626.000000
species[T.Perch]   -243.760714
species[T.Pike]      92.705882
species[T.Roach]   -473.950000
dtype: float64

Model with or without an intercept¶

The coefficients for each category are calculated relative to the intercept. This way of displaying results can be useful for models with multiple explanatory variables, but for simple linear regression, it’s just confusing. Fortunately, we can fix it. By changing the formula slightly to append “plus zero”, we specify that all the coefficients should be given relative to zero. Equivalently, it means we are fitting a linear regression without an intercept term. If you subtract two hundred and thirty five point fifty-nine from six hundred and seventeen point eighty-three, you get three hundred and eighty two point twenty four, which is the mean mass of a perch.

Now these coefficients make more sense. They are all just the mean masses for each species. This is a reassuringly boring result. When you only have a single, categorical explanatory variable, the linear regression coefficients are simply the means of each category.

In [9]:

mdl_mass_v_species = ols('mass_g ~ species + 0',data=fish).fit()
print(mdl_mass_v_species.params)

species[Bream]    626.000000
species[Perch]    382.239286
species[Pike]     718.705882
species[Roach]    152.050000
dtype: float64

The fish dataset: bream¶

Here’s the fish dataset again. This time, we’ll look only at the bream data. There’s a new explanatory variable too: the length of each fish, which we’ll use to predict the mass of the fish.

In [10]:

bream = fish[fish['species'] == 'Bream']
print(bream.head())

          mass_g  length_cm  Length2  Length3  Height  Width  SpeciesCd  \
rownames                                                                  
1          242.0       23.2     25.4     30.0    38.4   13.4          1   
2          290.0       24.0     26.3     31.2    40.0   13.8          1   
3          340.0       23.9     26.5     31.1    39.8   15.1          1   
4          363.0       26.3     29.0     33.5    38.0   13.3          1   
5          430.0       26.5     29.0     34.0    36.6   15.1          1   

         species  
rownames          
1          Bream  
2          Bream  
3          Bream  
4          Bream  
5          Bream

Plotting mass vs. length¶

Here’s a scatter plot of mass versus length for the bream data, with a linear trend line.

In [28]:

sns.regplot(data=bream
               ,y='mass_g'
               ,x='length_cm'
               ,ci=None)

plt.title('Plotting Mass v. Length')
plt.show()

Running the model¶

Before we can make predictions, we need a fitted model. As before, we call ols() with a formula and the dataset, after which we add .fit(). The response, mass in grams, goes on the left-hand side of the formula, and the explanatory variable, length in centimeters, goes on the right. We need to assign the result to a variable to reuse later on. To view the coefficients of the model, we use the params attribute in a print() call.

In [73]:

mdl_mass_v_len = ols('mass_g ~ length_cm', data=bream).fit()
mdl_bream = ols('mass_g ~ length_cm', data=bream).fit()
print(mdl_mass_v_len.params)

Intercept   -1015.049819
length_cm      54.107539
dtype: float64

Data on explanatory values to predict¶

The principle behind predicting is to ask questions of the form “if I set the explanatory variables to these values, what value would the response variable have?”. That means that the next step is to choose some values for the explanatory variables. To create new explanatory data, we need to store our explanatory variables of choice in a pandas DataFrame. You can use a dictionary to specify the columns. For this model, the only explanatory variable is the length of the fish. You can specify an interval of values using the np dot arange function, taking the start and end of the interval as arguments. Notice that the end of the interval does not include this value. Here, we specified a range of twenty to forty centimeters.

In [13]:

explanatory_data = pd.DataFrame({"length_cm":np.arange(20,41)})
print(explanatory_data.head())

   length_cm
0         20
1         21
2         22
3         23
4         24

Call predict()¶

The next step is to call predict on the model, passing the DataFrame of explanatory variables as the argument. The predict function returns a Series of predictions, one for each row of the explanatory data.

In [14]:

prediction = mdl_mass_v_len.predict(explanatory_data)
print(prediction.head())

0     67.100954
1    121.208492
2    175.316031
3    229.423570
4    283.531108
dtype: float64

Predicting inside a DataFrame¶

Having a single column of predictions isn’t that helpful to work with. It’s easier to work with if the predictions are in a DataFrame alongside the explanatory variables. To do this, you can use the pandas assign() method. It returns a new object with all original columns in addition to new ones. You start with the existing column, explanatory_data. Then, you use .assign() to add a new column, named after the response variable, mass_g. You calculate it with the same predict code from the previous slide. The resulting DataFrame contains both the explanatory variable and the predicted response. Now we can answer questions like “how heavy would we expect a bream with length twenty three centimeters to be?”, even though the original dataset didn’t include a bream of that exact length. Looking at the prediction data, you can see that the predicted mass is 229 grams.

In [15]:

predicted_response_data = explanatory_data.assign(mass_g=prediction)
print(predicted_response_data.head())

   length_cm      mass_g
0         20   67.100954
1         21  121.208492
2         22  175.316031
3         23  229.423570
4         24  283.531108

Showing predictions¶

Let’s include the predictions we just made on the scatter plot. To plot multiple layers, we set a matplotlib figure object called fig before calling regplot and scatterplot. As a result, the plt dot show call will then plot both graphs on the same figure. I’ve marked the prediction points in red squares to distinguish them from the actual data points. Notice that the predictions lie exactly on the trend line.

In [27]:

fig = plt.figure()
sns.regplot(data=bream
           ,y='mass_g'
           ,x='length_cm'
           ,ci=None)
sns.scatterplot(data=predicted_response_data
               ,y='mass_g'
               ,x='length_cm'
               ,color='red'
               ,marker='s'
               ,s=100)
plt.title('Predictive Plot of Bream Mass v Len')
plt.show()

Extrapolating¶

All the fish were between twenty three and thirty eight centimeters, but the linear model allows us to make predictions outside that range. This is called extrapolating. Let’s see what prediction we get for a ten centimeter bream. To achieve this, you first create a DataFrame with a single observation of 10 cm. You then predict the corresponding mass as before. Wow. The predicted mass is almost minus five hundred grams! This is obviously not physically possible, so the model performs poorly here. Extrapolation is sometimes appropriate, but can lead to misleading or ridiculous results. You need to understand the context of your data in order to determine whether it is sensible to extrapolate.

In [21]:

lil_bream_data = pd.DataFrame({'length_cm':[10]})
lil_bream_prediction = mdl_mass_v_len.predict(lil_bream_data)
lil_bream_prediction_data = lil_bream_data.assign(mass_g = lil_bream_prediction)
print(lil_bream_prediction_data)

   length_cm      mass_g
0         10 -473.974433

In [26]:

fig = plt.figure()
sns.regplot(data=bream
           ,y='mass_g'
           ,x='length_cm'
           ,ci=None)
sns.scatterplot(data=lil_bream_prediction_data
               ,y='mass_g'
               ,x='length_cm'
               ,color='red'
               ,marker='s'
               ,s=250)

plt.title('Mass v Extrapolative Plot of Bream Length')
plt.show()

Working with model objects & the .params attribute¶

We already learned how to extract the coefficients or parameters from your fitted model. We add the dot params attribute, which will return a pandas Series including your intercept and slope.

In [36]:

from statsmodels.formula.api import ols
mdl_mass_vs_length = ols('mass_g ~ length_cm', data = bream).fit()
print(mdl_mass_vs_length.params)

Intercept   -1015.049819
length_cm      54.107539
dtype: float64

.fittedvalues attribute¶

“Fitted values” is jargon for predictions on the original dataset used to create the model. We access them with the fittedvalues attribute. The result is a pandas Series of length thirty five, which is the number of rows in the bream dataset. The fittedvalues attribute is essentially a shortcut for taking the explanatory variable columns from the dataset, then feeding them to the predict function.

In [37]:

mdl_fitted = pd.DataFrame()
mdl_fitted['fittedval'] = mdl_mass_vs_length.fittedvalues

explanatory_data = bream['length_cm']
mdl_fitted['predicted_explanatory_data'] = mdl_mass_vs_length.predict(explanatory_data)
print(mdl_fitted)

            fittedval  predicted_explanatory_data
rownames                                         
1          240.245077                  240.245077
2          283.531108                  283.531108
3          278.120354                  278.120354
4          407.978447                  407.978447
5          418.799955                  418.799955
6          435.032217                  435.032217
7          435.032217                  435.032217
8          478.318247                  478.318247
9          478.318247                  478.318247
10         527.015032                  527.015032
11         521.604278                  521.604278
12         537.836540                  537.836540
13         559.479555                  559.479555
15         575.711817                  575.711817
16         575.711817                  575.711817
17         629.819356                  629.819356
18         629.819356                  629.819356
19         656.873125                  656.873125
20         662.283879                  662.283879
21         678.516140                  678.516140
22         683.926894                  683.926894
23         689.337648                  689.337648
24         705.569910                  705.569910
25         710.980664                  710.980664
26         705.569910                  705.569910
27         716.391418                  716.391418
28         754.266695                  754.266695
29         759.677448                  759.677448
30         797.552725                  797.552725
31         878.714033                  878.714033
32         878.714033                  878.714033
33         943.643080                  943.643080
34        1008.572126                 1008.572126
35        1041.036649                 1041.036649

.resid attribute¶

“Residuals” are a measure of inaccuracy in the model fit, and are accessed with the resid attribute. Like fitted values, there is one residual for each row of the dataset. Each residual is the actual response value minus the predicted response value. In this case, the residuals are the masses of breams, minus the fitted values. We illustrated the residuals as red lines on the regression plot. Each vertical line represents a single residual. We’ll see more on how to use the fitted values and residuals to assess the quality of your model later.

In [40]:

residuals = pd.DataFrame()

residuals['mdl_resi'] = mdl_mass_vs_length.resid

response_data = bream['mass_g']
residuals['mdl_minus_fitted'] = response_data - mdl_mass_vs_length.fittedvalues

print(residuals)

            mdl_resi  mdl_minus_fitted
rownames                              
1           1.754923          1.754923
2           6.468892          6.468892
3          61.879646         61.879646
4         -44.978447        -44.978447
5          11.200045         11.200045
6          14.967783         14.967783
7          64.967783         64.967783
8         -88.318247        -88.318247
9         -28.318247        -28.318247
10        -27.015032        -27.015032
11        -46.604278        -46.604278
12        -37.836540        -37.836540
13        -59.479555        -59.479555
15         24.288183         24.288183
16         24.288183         24.288183
17         70.180644         70.180644
18         70.180644         70.180644
19        -46.873125        -46.873125
20        -12.283879        -12.283879
21       -103.516140       -103.516140
22          1.073106          1.073106
23        -69.337648        -69.337648
24        -25.569910        -25.569910
25        -10.980664        -10.980664
26         19.430090         19.430090
27          3.608582          3.608582
28        -40.266695        -40.266695
29         90.322552         90.322552
30        202.447275        202.447275
31         41.285967         41.285967
32         76.285967         76.285967
33        -18.643080        -18.643080
34        -33.572126        -33.572126
35        -91.036649        -91.036649

In [78]:

# Graph Residuals
# reference: https://stackoverflow.com/questions/51220918/python-plot-residuals-on-a-fitted-model

df = pd.DataFrame()

df['length_cm'] = explanatory_data
df['fitted_values'] = mdl_mass_vs_length.fittedvalues
df['residuals'] = mdl_mass_vs_length.resid

x = df['length_cm']
y = df['fitted_values'] 
dy = df['residuals']

fig, ax = plt.subplots()
ax.plot(x,y)
ax.scatter(x,y+dy)

ax.vlines(x,y,y+dy)

plt.ylabel('Length in Cm')
plt.xlabel('Mass in G')
plt.title('Plot of Residuals for OLS of Length v Mass')
plt.show()

.summary() method¶

The summary method shows a more extended printout of the details of the model. Let’s step through this piece by piece.

In [74]:

mdl_mass_vs_length.summary()

Out[74]:

OLS Regression Results
Dep. Variable:	mass_g	R-squared:	0.911
Model:	OLS	Adj. R-squared:	0.908
Method:	Least Squares	F-statistic:	328.4
Date:	Mon, 23 Oct 2023	Prob (F-statistic):	2.18e-18
Time:	20:33:20	Log-Likelihood:	-187.82
No. Observations:	34	AIC:	379.6
Df Residuals:	32	BIC:	382.7
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-1015.0498	91.184	-11.132	0.000	-1200.786	-829.314
length_cm	54.1075	2.986	18.123	0.000	48.026	60.189

Omnibus:	9.859	Durbin-Watson:	1.368
Prob(Omnibus):	0.007	Jarque-Bera (JB):	9.210
Skew:	0.946	Prob(JB):	0.0100
Kurtosis:	4.709	Cond. No.	260.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

First, we see the dependent variable(s) that were used in the model, in addition to the type of regression. We also see some metrics on the performance of the model. These will be discussed later.

In the second part of the summary, we see details of the coefficients. The numbers in the first column (coef) are the ones contained in the params attribute. The numbers in the fourth column P>|t| are the p-values, which refer to statistical significance.

Transforming variables¶

Sometimes, the relationship between the explanatory variable and the response variable may not be a straight line. To fit a linear regression model, you may need to transform the explanatory variable or the response variable, or both of them.

Perch dataset¶

Consider the perch in the fish dataset.

It’s not a linear relationship¶

The upward curve in the mass versus length data prevents us drawing a straight line that follows it closely.

In [56]:

perch = fish[fish['species']=='Perch']

print(perch.head())

          mass_g  length_cm  Length2  Length3  Height  Width  SpeciesCd  \
rownames                                                                  
104          5.9        7.5      8.4      8.8    24.0   16.0          7   
105         32.0       12.5     13.7     14.7    24.0   13.6          7   
106         40.0       13.8     15.0     16.0    23.9   15.2          7   
107         51.5       15.0     16.2     17.2    26.7   15.3          7   
108         70.0       15.7     17.4     18.5    24.8   15.9          7   

         species  
rownames          
104        Perch  
105        Perch  
106        Perch  
107        Perch  
108        Perch

In [57]:

sns.regplot(x='length_cm'
           ,y='mass_g'
           ,data=perch
           ,ci=None)

plt.ylabel('Mass (g)')
plt.xlabel('Length (cm)')
plt.title('Perch Mass v Length')
plt.show()

Bream vs. perch¶

To understand why the bream had a strong linear relationship between mass and length, but the perch didn’t, you need to understand your data. I’m not a fish expert, but looking at the picture of the bream on the left, it has a very narrow body. I guess that as bream get bigger, they mostly get longer and not wider. By contrast, the perch on the right has a round body, so I guess that as it grows, it gets fatter and taller as well as longer. Since the perches are growing in three directions at once, maybe the length cubed will give a better fit.

BREAM_V_PERCH

Plotting mass vs. length cubed¶

Here’s an update to the previous plot. The only change is that the x-axis is now length to the power of three. To do this, first create an additional column where you calculate the length cubed. Then replace this newly created column in your regplot call. The data points fit the line much better now, so we’re ready to run a model.

In [59]:

perch['length_cm_cubed'] = perch['length_cm'] ** 3

/var/folders/qh/8n85f6ts44s2bcxhlgv5k02c0000gn/T/ipykernel_7193/3381336321.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  perch['length_cm_cubed'] = perch['length_cm'] ** 3

In [60]:

sns.regplot(x='length_cm_cubed'
           ,y='mass_g'
           ,data=c
           ,ci=None)

plt.ylabel('Mass (g)')
plt.xlabel('Length (cm^3)')
plt.title('Perch Mass v Length Cubed')
plt.show()

Modeling mass vs. length cubed¶

To model this transformation, we replace the original length variable with the cubed length variable. We then fit the model and extract its coefficients.

In [62]:

mdl_perch = ols('mass_g ~ length_cm_cubed', data = perch).fit()

print(mdl_perch.params)

Intercept         -0.117478
length_cm_cubed    0.016796
dtype: float64

Predicting mass vs. length cubed¶

We create the explanatory DataFrame in the same way as usual. Notice that you specify the lengths cubed. We can also add the untransformed lengths column for reference. The code for adding predictions is the same assign and predict combination as you’ve seen before.

In [65]:

import numpy as np

explanatory_perch_data = pd.DataFrame({'length_cm' : np.arange(10,41,5)
                                      ,'length_cm_cubed' : np.arange(10,41,5) ** 3})

predicted_response = mdl_perch.predict(explanatory_perch_data)

prediction_data = explanatory_perch_data.assign(mass_g = predicted_response)

print(prediction_data.head())

   length_cm  length_cm_cubed      mass_g
0         10             1000   16.678135
1         15             3375   56.567717
2         20             8000  134.247429
3         25            15625  262.313982
4         30            27000  453.364084

Plotting mass vs. length cubed (1/2)¶

The predictions have been added to the plot of mass versus length cubed as red points. As you might expect, they follow the line drawn by regplot.

In [68]:

fig = plt.figure()

sns.regplot(x = 'length_cm_cubed'
           ,y = 'mass_g'
           ,data = perch
           ,ci = None
           )

sns.scatterplot(x = 'length_cm_cubed'
               ,y = 'mass_g'
               ,data = prediction_data
               ,marker = 's'
               ,s = 200
               ,color = 'red')

plt.ylabel('Mass (g)')
plt.xlabel('Length (cm^3)')
plt.title('Perch Mass v Length Cubed')
plt.show()

Plotting mass vs. length cubed (2/2)¶

This gets more interesting on the original x-axis. Notice how the red points curve upwards to follow the data. Your linear model has non-linear predictions, after the transformation is undone.

In [71]:

fig = plt.figure()

sns.scatterplot(x = 'length_cm'
           ,y = 'mass_g'
           ,data = perch
           )

sns.scatterplot(x = 'length_cm'
               ,y = 'mass_g'
               ,data = prediction_data
               ,marker = 's'
               ,s = 200
               ,color = 'red')

plt.ylabel('Mass (g)')
plt.xlabel('Length (cm^3)')
plt.title('Perch Mass v Length Cubed')
plt.show()

Quantifying model fit¶

It’s usually essential to know whether or not predictions from your model are nonsense. In this chapter, we’ll look at ways of quantifying how good your model is.

Bream and perch models¶

Previously, you ran models on mass versus length for bream and perch. By merely looking at these scatter plots, you can get a sense that there is a linear relationship between mass and length for bream but not for perch. It would be useful to quantify how strong that linear relationship is.

Coefficient of determination¶

The first metric we’ll discuss is the coefficient of determination. This is sometimes called “r-squared”. For boring historical reasons, it’s written with a lower case r for simple linear regression and an upper case R when you have more than one explanatory variable. It is defined as the proportion of the variance in the response variable that is predictable from the explanatory variable. We’ll get to a human-readable explanation shortly. A score of one means you have a perfect fit, and a score of zero means your model is no better than randomness. What constitutes a good score depends on your dataset. A score of zero-point five on a psychological experiment may be exceptionally high because humans are inherently hard to predict, but in other cases, a score of zero-point nine may be considered a poor fit.

The proportion of the variance in the response variable that is predictable from the explanatory variable.
1 means a perfect fit
0 means the worst possible fit

`.summary()`¶

The dot summary method shows several performance metrics in its output. The coefficient of determination is written in the first line and titled “R-squared”. Its value is about zero-point-nine-one.

In [75]:

print(mdl_bream.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 mass_g   R-squared:                       0.911
Model:                            OLS   Adj. R-squared:                  0.908
Method:                 Least Squares   F-statistic:                     328.4
Date:                Tue, 19 Dec 2023   Prob (F-statistic):           2.18e-18
Time:                        17:42:23   Log-Likelihood:                -187.82
No. Observations:                  34   AIC:                             379.6
Df Residuals:                      32   BIC:                             382.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -1015.0498     91.184    -11.132      0.000   -1200.786    -829.314
length_cm     54.1075      2.986     18.123      0.000      48.026      60.189
==============================================================================
Omnibus:                        9.859   Durbin-Watson:                   1.368
Prob(Omnibus):                  0.007   Jarque-Bera (JB):                9.210
Skew:                           0.946   Prob(JB):                       0.0100
Kurtosis:                       4.709   Cond. No.                         260.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

`.rsquared` attribute¶

Since the output of dot summary isn’t easy to work with, a better way to extract the metric is to use the rsquared attribute, which contains the r-squared value as a float.

In [88]:

print('Rsquared:',mdl_bream.rsquared)

Rsquared: 0.911218550754384

It’s just correlation squared¶

For simple linear regression, the interpretation of the coefficient of determination (r-value) is straightforward. It is simply the correlation between the explanatory and response variables, squared.

In [89]:

# coeff_determination = independent_var.corr(dependent_var) ** 2
coeff_determination = bream['length_cm'].corr(bream['mass_g']) ** 2
print('Coefficient of determination:',coeff_determination)

Coefficient of determination: 0.9112185507543842

Residual standard error (RSE)¶

The second metric we’ll look at is the residual standard error, or RSE. Recall that each residual is the difference between a predicted value and an observed value. The RSE is, very roughly speaking, a measure of the typical size of the residuals. That is, how much the predictions are typically wrong. It has the same unit as the response variable. In the fish models, the response unit is grams. A related, but less commonly used metric is the mean squared error, or MSE. As the name suggests, MSE is the squared residual standard error.

A “typical” difference between a predicted value and an observed value
It has the same unit as the response variable
$MSE = RSE^{2}$

In [80]:

x = df['length_cm']
y = df['fitted_values'] 
dy = df['residuals']

fig, ax = plt.subplots()
ax.plot(x,y)
ax.scatter(x,y+dy)

ax.vlines(x,y,y+dy)

plt.ylabel('Length in Cm')
plt.xlabel('Mass in G')
plt.title('Plot of Residuals for OLS of Length v Mass')
plt.show()

`.mse_resid` attribute¶

The summary method unfortunately doesn’t contain the RSE. However, it can indirectly be retrieved from the .mse_resid attribute, which contains the mean squared error of the residuals. We can calculate the RSE by taking the square root of MSE. As such, the RSE has the same unit as the response variable. The RSE for the bream model is about seventy-four.

In [90]:

print('mse:', mdl_bream.mse_resid)
print('rse:', mdl_bream.mse_resid ** 0.5)

mse: 3908.1039073862407
rse: 62.51482949977742

In [ ]: