Accidentally I have noticed, that OLS models implemented by <code>sklearn</code> and <code>statsmodels</code> yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields: <pre class="prettyprint"><code>import numpy as np import sklearn import statsmodels import sklearn.linear_model as sl import statsmodels.api as sm np.random.seed(42) N=1000 X = np.random.normal(loc=1, size=(N, 1)) Y = 2 * X.flatten() + 4 + np.random.normal(size=N) sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y) sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y) statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X)) statsmodelsNoIntercept = sm.OLS(Y, X) print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared) print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared) print(sklearn.__version__, statsmodels.__version__) </code></pre> prints: <pre class="prettyprint"><code>0.78741906105 0.78741906105 -0.950825182861 0.783154483028 0.19.1 0.8.0 </code></pre> Where the difference comes from? The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there <code>sklearn.linear_model.LinearModel</code> (with intercept) was fit for X prepared as for <code>statsmodels.api.OLS</code>. The question differs from Statsmodels: Calculate fitted values and R squared as it addresses difference between two Python packages (<code>statsmodels</code> and <code>scikit-learn</code>) while linked question is about <code>statsmodels</code> and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?

As pointed by @user333700 in comments, OLS definition of R^2 is different in <code>statsmodels</code>' implementation than in <code>scikit-learn</code>'s. From documentation of <code>RegressionResults</code> class (emphasis mine): <blockquote> rsquared <blockquote> R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted. </blockquote> </blockquote> From documentation of <code>LinearRegression.score()</code>: <blockquote> score(X, y, sample_weight=None) <blockquote> Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. </blockquote> </blockquote>

Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?

Tags:

python

python-3.x

scikit-learn

linear-regression

statsmodels

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

prints:

0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0

Where the difference comes from?

The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel (with intercept) was fit for X prepared as for statsmodels.api.OLS.

The question differs from Statsmodels: Calculate fitted values and R squared as it addresses difference between two Python packages (statsmodels and scikit-learn) while linked question is about statsmodels and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?

439

asked Feb 16 '18 18:02

abukaj

1 Answers

As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels' implementation than in scikit-learn's.

From documentation of RegressionResults class (emphasis mine):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

From documentation of LinearRegression.score():

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

154

answered Sep 21 '22 23:09

abukaj

Related questions
                            
                                Interactive Ipython Notebooks on Heroku
                            
                                Predicting next word using the language model tensorflow example
                            
                                Execute Python script from Android app in Java?
                            
                                Ignore some modules in autodoc
                            
                                A solution to SQLAlchemy temporary table pain?
                            
                                How do I determine whether a container is infinitely recursive and find its smallest unique container?
                            
                                How can I get stderr from os.popen()?
                            
                                How to generate noisy mock time series or signal (in Python)
                            
                                How to create Pandas Series with Decimal?
                            
                                How to resolve "chromedriver executable needs to be in PATH" error when running Selenium Chrome using virtualenv within PyDev?
                            
                                Allow Python.app on El Capitan (OS X)
                            
                                TensorFlow: getting all states from a RNN
                            
                                Regression Tests on Arbitrary Number Sequences
                            
                                Python: AWS Lambda "errorMessage": "Unable to import module '<module_name>'"
                            
                                Django: efficient template/string separation and override
                            
                                Get inner type from concrete type associated with a TypeVar
                            
                                Why am I not seeing speed up via multiprocessing in Python?
                            
                                How to find how many Image Generated By ImageDataGenerator
                            
                                Getting selenium to launch safari with default profile in python
                            
                                Creating the xml format to pass to zeep

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?

Tags:

python

python-3.x

scikit-learn

linear-regression

statsmodels

abukaj

People also ask

1 Answers

abukaj

Recent Activity

Donate For Us