27 Linear Models in Python

  • Python package statmodels cover most of frequently used statistical models including linear model, generalised linear model, mixed effects model, mutivariate models etc.

  • The statmodels package works seamlessly with pandas library


27.1 Python: Linear Model

  • Python package statmodels fits a comprehensive range of statistical models and statistical tests

  • statmodels: statistical models, hypothesis tests, and data exploration

  • statmodels covers a comprehensive range of statistical models: Linear Models, Generalised Linear Models, Linear Mixed Models, Generalised Additive Models, Generalised Linear Mixed Models, Time Series analysis, Survival analysis, Nonparametric methods, Multivariate Statistics.

  • Check statmodels user guide for further details

  • statsmodels supports specifying models using R-style formulas and pandas DataFrames.

  • The approach to the plotting is very similar to R scripts

  • Install statmodels and its dependencies from PyPI repository: pip install statmodels


27.1.1 Steps of model fitting

  • Import statmodels; import statsmodels.api as sm

  • Import other relevant libraries: pandas, numpy

  • Read the pandas DataFrame

  • Use the specified model function

  • Include the model formula: y ~ x1 + x2 + x3

  • Assign the fitted model to a Python object

  • Explore the fitted model


27.1.2 Laod Libraries & Read Data

Import statmodels

import statsmodels.api as sm

import statsmodels.formula.api as smf

Import other libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

Set the working directory to the data folder

Read the iris dataset as a Python object DF

DF = pd.read_csv('iris.csv')


27.1.3 Fit the model

Note:

  • First describe the model

  • The formula interface is enclose with a quote marks

  • C() indicates a categorical variable`

  • Fit the model using the fit method; you can also include fit with the describe script


# Describe the model
model = smf.ols('SepalLength ~ PetalLength + C(Species)', data = DF)

# Fit the model
fm_ols = model.fit()

# Describe and fit
# fm_fit = smf.ols('SepalLength ~ PetalLength + C(Species)', data = DF).fit()

27.1.4 Class ols components


dir(fm_ols)
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']

27.1.5 Explore the model


print(fm_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            SepalLength   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.833
Method:                 Least Squares   F-statistic:                     249.4
Date:                Sat, 07 Oct 2023   Prob (F-statistic):           3.10e-57
Time:                        15:25:59   Log-Likelihood:                -48.116
No. Observations:                 150   AIC:                             104.2
Df Residuals:                     146   BIC:                             116.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                    3.6835      0.106     34.719      0.000       3.474       3.893
C(Species)[T.versicolor]    -1.6010      0.193     -8.275      0.000      -1.983      -1.219
C(Species)[T.virginica]     -2.1177      0.273     -7.744      0.000      -2.658      -1.577
PetalLength                  0.9046      0.065     13.962      0.000       0.777       1.033
==============================================================================
Omnibus:                        0.578   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.749   Jarque-Bera (JB):                0.672
Skew:                           0.140   Prob(JB):                        0.715
Kurtosis:                       2.830   Cond. No.                         54.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

27.1.6 Check model assumptions

Note

  • It is not trivial to produce residuals plots with a call to a simple function

  • You need to to compute the residuals and then plot each of the residuals plots.

  • You can run the following script to create the plots

  • The full discussion of the script is beyond the scope of this tutorial but detailed instructions and code can be found here.

Click to toggle script window
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

def graph(formula, x_range, label=None):
    """
    Helper function for plotting cook's distance lines
    """
    x = x_range
    y = formula(x)
    plt.plot(x, y, label=label, lw=1, ls='--', color='red')



# Convert categorical string to codes
DF['Class'] = pd.Categorical(DF.Species).codes

# create dataframe from X, y for easier plot handling
X = DF['SepalLength']
y = DF[['PetalLength', 'Class']]
dataframe = pd.concat([X, y], axis=1)

# print(dataframe)

# model values
model_fitted_y = fm_ols.fittedvalues
# model residuals
model_residuals = fm_ols.resid
# normalized residuals
model_norm_residuals = fm_ols.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = fm_ols.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = fm_ols.get_influence().cooks_distance[0]

plot_lm_1 = plt.figure()
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, dataframe.columns[-1], data=dataframe,
                          lowess=True,
                          scatter_kws={'alpha': 0.5},
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');

# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_3 = abs_resid[:3]
for i in abs_resid_top_3.index:
    plot_lm_1.axes[0].annotate(i,
                                xy=(model_fitted_y[i],
                                    model_residuals[i]));

QQ = sm.ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');

# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i,
                                xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                    model_norm_residuals[i]));

plot_lm_3 = plt.figure()
plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5);
sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
plot_lm_3.axes[0].set_title('Scale-Location')
plot_lm_3.axes[0].set_xlabel('Fitted values')
plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');

# annotations
abs_sq_norm_resid = np.flip(np.argsort(model_norm_residuals_abs_sqrt), 0)
abs_sq_norm_resid_top_3 = abs_sq_norm_resid[:3]
for i in abs_norm_resid_top_3:
    plot_lm_3.axes[0].annotate(i,
                                xy=(model_fitted_y[i],
                                    model_norm_residuals_abs_sqrt[i]));


plot_lm_4 = plt.figure();
plt.scatter(model_leverage, model_norm_residuals, alpha=0.5);
sns.regplot(model_leverage, model_norm_residuals,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
plot_lm_4.axes[0].set_xlim(0, max(model_leverage)+0.01)
plot_lm_4.axes[0].set_ylim(-3, 5)
plot_lm_4.axes[0].set_title('Residuals vs Leverage')
plot_lm_4.axes[0].set_xlabel('Leverage')
plot_lm_4.axes[0].set_ylabel('Standardized Residuals');

# annotations
leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]
for i in leverage_top_3:
    plot_lm_4.axes[0].annotate(i,
                                xy=(model_leverage[i],
                                    model_norm_residuals[i]));

p = len(fm_ols.params) # number of model parameters
graph(lambda x: np.sqrt((0.5 * p * (1 - x)) / x),
      np.linspace(0.001, max(model_leverage), 50),
      'Cook\'s distance') # 0.5 line
graph(lambda x: np.sqrt((1 * p * (1 - x)) / x),
      np.linspace(0.001, max(model_leverage), 50)) # 1 line
plot_lm_4.legend(loc='upper right');


Check statsmodels help file for further details on linear regression diagnostics