I did a simple experiment of a GLM in statsmodels, and was perplexed to find why GLM Results did not contain any R^2 attributes?
I feel like there is something very simple here about why GLM does not have an R^2 calculation, and ways that I can calculate it myself.
Thanks!
In [1]: import pandas as np
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: import statsmodels.api as sm
In [5]: data = pd.DataFrame({'col1':np.arange(10),'col2':np.arange(
KeyboardInterrupt
In [5]: x = np.arange(0,10,0.5)
In [6]:
In [6]: y = np.zeros(len(x))
In [7]: y[0] = 0
In [8]: for i in range(1,len(x)):
...: y[i] = 0.5*x[i] + 2.5*y[i-1] + 10*np.random.rand()
...:
In [9]: print y
[ 0.00000000e+00 9.35177024e-01 8.18487881e+00 2.95126464e+01
8.08584645e+01 2.11423251e+02 5.38685230e+02 1.35653420e+03
3.39564225e+03 8.49234338e+03 2.12377817e+04 5.31015961e+04
1.32764789e+05 3.31924691e+05 8.29818265e+05 2.07455796e+06
5.18640343e+06 1.29660216e+07 3.24150658e+07 8.10376747e+07]
In [10]: X = pd.DataFrame({'x1':x[1:],'y-Lag1':y[:-1]})
In [11]: m1 = sm.GLM(y[1:],X).fit()
In [12]: m1.summary()
Out[12]:
<class 'statsmodels.iolib.summary.Summary'>
"""
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 19
Model: GLM Df Residuals: 17
Model Family: Gaussian Df Model: 1
Link Function: identity Scale: 12.9022715725
Method: IRLS Log-Likelihood: -50.199
Date: Thu, 23 Oct 2014 Deviance: 219.34
Time: 13:44:22 Pearson chi2: 219.
No. Iterations: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.5746 0.175 8.999 0.000 1.232 1.918
y-Lag1 2.5000 1.23e-07 2.03e+07 0.000 2.500 2.500
==============================================================================
"""
In [13]: m1.
m1.aic m1.llf m1.remove_data
m1.bic m1.load m1.resid_anscombe
m1.bse m1.model m1.resid_deviance
m1.conf_int m1.mu m1.resid_pearson
m1.cov_params m1.nobs m1.resid_response
m1.deviance m1.norm_resid m1.resid_working
m1.df_model m1.normalized_cov_params m1.save
m1.df_resid m1.null m1.scale
m1.f_test m1.null_deviance m1.summary
m1.family m1.params m1.summary2
m1.fit_history m1.pearson_chi2 m1.t_test
m1.fittedvalues m1.pinv_wexog m1.tvalues
m1.initialize m1.predict
m1.k_constant m1.pvalues
For GLM with Gaussian errors and the identity link, R^2 makes sense (if the model has a constant), but it doesn't make sense as a general goodness of fit measure for GLM. You can file an enhancement request (or better yet a pull request) for including some better, more general goodness of fit statistics in the GLM results.
You can read some more about this in the context of R here.
Not sure about why its implemented this way, but the SO answer above and some wikipedia-ing allowed me to calculate the R^2 by hand easily:
sst_val = sum(map(lambda x: np.power(x,2),y-np.mean(y)))
sse_val = sum(map(lambda x: np.power(x,2),m1.resid_response))
r2 = 1.0 - sse_val/sst_val
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With