Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't statsmodels GLM have R^2 in results?

I did a simple experiment of a GLM in statsmodels, and was perplexed to find why GLM Results did not contain any R^2 attributes?

I feel like there is something very simple here about why GLM does not have an R^2 calculation, and ways that I can calculate it myself.

Thanks!

In [1]: import pandas as np

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import statsmodels.api as sm

In [5]: data = pd.DataFrame({'col1':np.arange(10),'col2':np.arange(
KeyboardInterrupt

In [5]: x  = np.arange(0,10,0.5)

In [6]: 

In [6]: y = np.zeros(len(x))

In [7]: y[0] = 0

In [8]: for i in range(1,len(x)):
   ...:         y[i] = 0.5*x[i] + 2.5*y[i-1] + 10*np.random.rand()
   ...:     

In [9]: print y
[  0.00000000e+00   9.35177024e-01   8.18487881e+00   2.95126464e+01
   8.08584645e+01   2.11423251e+02   5.38685230e+02   1.35653420e+03
   3.39564225e+03   8.49234338e+03   2.12377817e+04   5.31015961e+04
   1.32764789e+05   3.31924691e+05   8.29818265e+05   2.07455796e+06
   5.18640343e+06   1.29660216e+07   3.24150658e+07   8.10376747e+07]

In [10]: X = pd.DataFrame({'x1':x[1:],'y-Lag1':y[:-1]})


In [11]: m1 = sm.GLM(y[1:],X).fit()

In [12]: m1.summary()
Out[12]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   19
Model:                            GLM   Df Residuals:                       17
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                   12.9022715725
Method:                          IRLS   Log-Likelihood:                -50.199
Date:                Thu, 23 Oct 2014   Deviance:                       219.34
Time:                        13:44:22   Pearson chi2:                     219.
No. Iterations:                     3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.5746      0.175      8.999      0.000         1.232     1.918
y-Lag1         2.5000   1.23e-07   2.03e+07      0.000         2.500     2.500
==============================================================================
"""

In [13]: m1.
m1.aic                    m1.llf                    m1.remove_data
m1.bic                    m1.load                   m1.resid_anscombe
m1.bse                    m1.model                  m1.resid_deviance
m1.conf_int               m1.mu                     m1.resid_pearson
m1.cov_params             m1.nobs                   m1.resid_response
m1.deviance               m1.norm_resid             m1.resid_working
m1.df_model               m1.normalized_cov_params  m1.save
m1.df_resid               m1.null                   m1.scale
m1.f_test                 m1.null_deviance          m1.summary
m1.family                 m1.params                 m1.summary2
m1.fit_history            m1.pearson_chi2           m1.t_test
m1.fittedvalues           m1.pinv_wexog             m1.tvalues
m1.initialize             m1.predict                
m1.k_constant             m1.pvalues         
like image 410
Max Song Avatar asked Oct 24 '14 05:10

Max Song


2 Answers

For GLM with Gaussian errors and the identity link, R^2 makes sense (if the model has a constant), but it doesn't make sense as a general goodness of fit measure for GLM. You can file an enhancement request (or better yet a pull request) for including some better, more general goodness of fit statistics in the GLM results.

You can read some more about this in the context of R here.

like image 85
jseabold Avatar answered Oct 03 '22 04:10

jseabold


Not sure about why its implemented this way, but the SO answer above and some wikipedia-ing allowed me to calculate the R^2 by hand easily:

sst_val = sum(map(lambda x: np.power(x,2),y-np.mean(y))) 
sse_val = sum(map(lambda x: np.power(x,2),m1.resid_response)) 
r2 = 1.0 - sse_val/sst_val
like image 30
Max Song Avatar answered Oct 03 '22 05:10

Max Song