27 Linear Models in Python
-
Python package
statmodels
cover most of frequently used statistical models including linear model, generalised linear model, mixed effects model, mutivariate models etc. -
The
statmodels
package works seamlessly withpandas
library
27.1 Python: Linear Model
Python package
statmodels
fits a comprehensive range of statistical models and statistical testsstatmodels: statistical models, hypothesis tests, and data exploration
statmodels
covers a comprehensive range of statistical models: Linear Models, Generalised Linear Models, Linear Mixed Models, Generalised Additive Models, Generalised Linear Mixed Models, Time Series analysis, Survival analysis, Nonparametric methods, Multivariate Statistics.Check
statmodels
user guide for further detailsstatsmodels
supports specifying models using R-style formulas andpandas
DataFrames.The approach to the plotting is very similar to R scripts
Install
statmodels
and its dependencies fromPyPI
repository:pip install statmodels
27.1.1 Steps of model fitting
Import
statmodels
;import statsmodels.api as sm
Import other relevant libraries:
pandas
,numpy
Read the
pandas
DataFrameUse the specified model function
Include the model formula:
y ~ x1 + x2 + x3
Assign the fitted model to a Python object
Explore the fitted model
27.1.2 Laod Libraries & Read Data
Import statmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
Import other libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Set the working directory to the data folder
Read the iris dataset as a Python object DF
DF = pd.read_csv('iris.csv')
27.1.3 Fit the model
Note:
First describe the model
The formula interface is enclose with a quote marks
C()
indicates a categorical variable`Fit the model using the
fit
method; you can also include fit with the describe script
# Describe the model
= smf.ols('SepalLength ~ PetalLength + C(Species)', data = DF)
model
# Fit the model
= model.fit()
fm_ols
# Describe and fit
# fm_fit = smf.ols('SepalLength ~ PetalLength + C(Species)', data = DF).fit()
27.1.4 Class ols
components
dir(fm_ols)
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']
27.1.5 Explore the model
print(fm_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SepalLength R-squared: 0.837
Model: OLS Adj. R-squared: 0.833
Method: Least Squares F-statistic: 249.4
Date: Sat, 07 Oct 2023 Prob (F-statistic): 3.10e-57
Time: 15:25:59 Log-Likelihood: -48.116
No. Observations: 150 AIC: 104.2
Df Residuals: 146 BIC: 116.3
Df Model: 3
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 3.6835 0.106 34.719 0.000 3.474 3.893
C(Species)[T.versicolor] -1.6010 0.193 -8.275 0.000 -1.983 -1.219
C(Species)[T.virginica] -2.1177 0.273 -7.744 0.000 -2.658 -1.577
PetalLength 0.9046 0.065 13.962 0.000 0.777 1.033
==============================================================================
Omnibus: 0.578 Durbin-Watson: 1.802
Prob(Omnibus): 0.749 Jarque-Bera (JB): 0.672
Skew: 0.140 Prob(JB): 0.715
Kurtosis: 2.830 Cond. No. 54.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
27.1.6 Check model assumptions
Note
It is not trivial to produce residuals plots with a call to a simple function
You need to to compute the residuals and then plot each of the residuals plots.
You can run the following script to create the plots
The full discussion of the script is beyond the scope of this tutorial but detailed instructions and code can be found here.
Click to toggle script window
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
def graph(formula, x_range, label=None):
"""
Helper function for plotting cook's distance lines
"""
= x_range
x = formula(x)
y =label, lw=1, ls='--', color='red')
plt.plot(x, y, label
# Convert categorical string to codes
'Class'] = pd.Categorical(DF.Species).codes
DF[
# create dataframe from X, y for easier plot handling
= DF['SepalLength']
X = DF[['PetalLength', 'Class']]
y = pd.concat([X, y], axis=1)
dataframe
# print(dataframe)
# model values
= fm_ols.fittedvalues
model_fitted_y # model residuals
= fm_ols.resid
model_residuals # normalized residuals
= fm_ols.get_influence().resid_studentized_internal
model_norm_residuals # absolute squared normalized residuals
= np.sqrt(np.abs(model_norm_residuals))
model_norm_residuals_abs_sqrt # absolute residuals
= np.abs(model_residuals)
model_abs_resid # leverage, from statsmodels internals
= fm_ols.get_influence().hat_matrix_diag
model_leverage # cook's distance, from statsmodels internals
= fm_ols.get_influence().cooks_distance[0]
model_cooks
= plt.figure()
plot_lm_1 0] = sns.residplot(model_fitted_y, dataframe.columns[-1], data=dataframe,
plot_lm_1.axes[=True,
lowess={'alpha': 0.5},
scatter_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
line_kws
0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');
plot_lm_1.axes[
# annotations
= model_abs_resid.sort_values(ascending=False)
abs_resid = abs_resid[:3]
abs_resid_top_3 for i in abs_resid_top_3.index:
0].annotate(i,
plot_lm_1.axes[=(model_fitted_y[i],
xy;
model_residuals[i]))
= sm.ProbPlot(model_norm_residuals)
QQ = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
plot_lm_2 0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');
plot_lm_2.axes[
# annotations
= np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid = abs_norm_resid[:3]
abs_norm_resid_top_3 for r, i in enumerate(abs_norm_resid_top_3):
0].annotate(i,
plot_lm_2.axes[=(np.flip(QQ.theoretical_quantiles, 0)[r],
xy;
model_norm_residuals[i]))
= plt.figure()
plot_lm_3 =0.5);
plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha
sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt,=False,
scatter=False,
ci=True,
lowess={'color': 'red', 'lw': 1, 'alpha': 0.8});
line_kws0].set_title('Scale-Location')
plot_lm_3.axes[0].set_xlabel('Fitted values')
plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');
plot_lm_3.axes[
# annotations
= np.flip(np.argsort(model_norm_residuals_abs_sqrt), 0)
abs_sq_norm_resid = abs_sq_norm_resid[:3]
abs_sq_norm_resid_top_3 for i in abs_norm_resid_top_3:
0].annotate(i,
plot_lm_3.axes[=(model_fitted_y[i],
xy;
model_norm_residuals_abs_sqrt[i]))
= plt.figure();
plot_lm_4 =0.5);
plt.scatter(model_leverage, model_norm_residuals, alpha
sns.regplot(model_leverage, model_norm_residuals,=False,
scatter=False,
ci=True,
lowess={'color': 'red', 'lw': 1, 'alpha': 0.8});
line_kws0].set_xlim(0, max(model_leverage)+0.01)
plot_lm_4.axes[0].set_ylim(-3, 5)
plot_lm_4.axes[0].set_title('Residuals vs Leverage')
plot_lm_4.axes[0].set_xlabel('Leverage')
plot_lm_4.axes[0].set_ylabel('Standardized Residuals');
plot_lm_4.axes[
# annotations
= np.flip(np.argsort(model_cooks), 0)[:3]
leverage_top_3 for i in leverage_top_3:
0].annotate(i,
plot_lm_4.axes[=(model_leverage[i],
xy;
model_norm_residuals[i]))
= len(fm_ols.params) # number of model parameters
p lambda x: np.sqrt((0.5 * p * (1 - x)) / x),
graph(0.001, max(model_leverage), 50),
np.linspace('Cook\'s distance') # 0.5 line
lambda x: np.sqrt((1 * p * (1 - x)) / x),
graph(0.001, max(model_leverage), 50)) # 1 line
np.linspace(='upper right'); plot_lm_4.legend(loc
Check statsmodels help file
for further details on linear regression diagnostics