Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get R-squared for robust regression (RLM) in Statsmodels?

When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models. But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results. Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?

Or is there another measure that can be used / calculated from the results produced by sm.RLS?

This is what Statsmodels is producing:

import numpy as np
import statsmodels.api as sm

# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())

Which outputs:

                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   50
Model:                            RLM   Df Residuals:                       48
Method:                          IRLS   Df Model:                            1
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                 Mo, 27 Jul 2015                                         
Time:                        10:00:00                                         
No. Iterations:                    17                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.0254      0.091     55.017      0.000         4.846     5.204
x1             0.4845      0.008     61.555      0.000         0.469     0.500
==============================================================================
like image 765
Primer Avatar asked Jul 27 '15 14:07

Primer


People also ask

Does robust regression have r squared?

Robust regression uses Iteratively Reweighted Least Squares(IRLS) for Maximum Likelihood Estimation(MLE) whereas linear regression uses Ordinary Least Squares(OLS), which is the reason R-squared(coefficient of determination) is returned by lm() and not by rlm().

Is R2 robust to outliers?

In section 3 we have introduced an alternative robust version of R2 based on LTS estimators. The influence function of R2 is discussed in section 4. are much more robust than others with both vertical and leverage points outliers.

What does it mean for a regression to be robust?

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations, and it can also be used for the purpose of detecting influential observations.


2 Answers

Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.

like image 184
majeed simaan Avatar answered Sep 24 '22 01:09

majeed simaan


R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.

Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.726
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     127.4
Date:                Wed, 03 Nov 2021   Prob (F-statistic):           4.15e-15
Time:                        09:33:40   Log-Likelihood:                -87.455
No. Observations:                  50   AIC:                             178.9
Df Residuals:                      48   BIC:                             182.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.7071      0.396     14.425      0.000       4.912       6.503
x1             0.3848      0.034     11.288      0.000       0.316       0.453
==============================================================================
Omnibus:                       23.499   Durbin-Watson:                   2.752
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.906
Skew:                          -1.649   Prob(JB):                     4.34e-08
Kurtosis:                       5.324   Cond. No.                         23.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   50
Model:                            RLM   Df Residuals:                       48
Method:                          IRLS   Df Model:                            1
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Wed, 03 Nov 2021                                         
Time:                        09:34:24                                         
No. Iterations:                    17                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1857      0.111     46.590      0.000       4.968       5.404
x1             0.4790      0.010     49.947      0.000       0.460       0.498
==============================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.

like image 33
Rob Avatar answered Sep 24 '22 01:09

Rob