Scikit-learn is returning coefficient of determination (R^2) values less than -1

Tags:

I'm doing a simple linear model. I have

fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores

which yields

[  0.00000000e+00   0.00000000e+00  -8.27299054e+02  -5.80431382e+00
  -1.04444147e-01  -1.19367785e+00  -1.24843536e+00  -3.39950443e-01
   1.95018287e-02  -9.73940970e-02]

How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?

425

asked Apr 12 '14 22:04

rhombidodecahedron

3 Answers

There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.

In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100)  # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')

This should result in negative r^2 values.

In [23]: scores
Out[23]: 
array([-240.17927358,   -5.51819556,  -14.06815196,  -67.87003867,
    -64.14367035])

The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline. Next you may want to try Ridge regression.

160

answered Oct 04 '22 07:10

eickenberg

Just because R^2 can be negative does not mean it should be.

Possibility 1: a bug in your code.

A common bug that you should double check is that you are passing in parameters correctly:

r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!

Possibility 2: small datasets

If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score() does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.

Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x n dataset (where m is the number of samples and n is the number of features) should be of a shape where

m > n^2

and when you using cross validation with f as the number of folds, you should aim for

m/f > n^2

answered Oct 04 '22 08:10

mgoldwasser

R² = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))² and TSS is the total sum of squares ∑(y - mean(y))². Now for R² ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:

>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581

answered Oct 04 '22 09:10

Fred Foo

Related questions
                            
                                How to skip a pytest using an external fixture?
                            
                                pandas filtering and comparing dates
                            
                                How to combine multiple rows into a single row with pandas [duplicate]
                            
                                How to split a .wav file into multiple .wav files?
                            
                                Key: value store in Python for possibly 100 GB of data, without client/server [closed]
                            
                                Access Python Development Server from External IP
                            
                                How do I get the name from a named tuple in python?
                            
                                ValueError: Dimension mismatch
                            
                                How can I send a signal from a python program?
                            
                                How to kill a running python process? [duplicate]
                            
                                How to properly create a pyinstaller hook, or maybe hidden import?
                            
                                convert series returned by pandas.Series.value_counts to a dictionary
                            
                                PyCharm Running Out of Memory
                            
                                Unzipping directory structure with python
                            
                                Best way to define multidimensional dictionaries in python? [duplicate]
                            
                                In python how to get name of a class inside its static method
                            
                                python: iterate a specific range in a list
                            
                                Pip Install -r continue past installs that fail
                            
                                Python dictionary in to html table
                            
                                Mocking __init__() for unittesting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-learn is returning coefficient of determination (R^2) values less than -1

Tags:

python

statistics

scikit-learn