Scikit-Learn giving incorrect R Squared value

Question

I'm training Machine Learning models on Python and using R squared metric from Scikit Learn to evaluate them. Id decided to play around with Scikit's r2_score function, feeding it a random array of same value as input y_true and and slightly different but same value array as y_predict. I was getting arbitrarily large (negative) values when the input length of array is 10 or more and 0 when the input length is less than 10.

from sklearn.metrics import r2_score
r2_score([213.91666667,  213.91666667,  213.91666667,  213.91666667,  213.91666667, 
      213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667],
    [213,  214,  214,  214,  214,  214,  214,  214,  214,  214])

>>> -1.1175847590636849e+26

r2_score([213.91666667,  213.91666667,  213.91666667,  213.91666667, 
      213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667],
    [213,  214,  214,  214,  214,  214,  214,  214,  214])

>>> 0

nicmet · Accepted Answer

You're correct in noting that the r2_score output is not correct. However, this is a result of a simpler computation issue rather than a problem with the scikit-learn package.

Try running

>>> input_list = [213.91666667,  213.91666667,  213.91666667,  213.91666667,  213.91666667, 
  213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667]
>>> sum(input_list)/len(input_list)

As you can see, the output is not exactly 213.91666667 (a limited precision error; you can read more about it here). Why does this matter?

Well, the section of the scikit-learn User Guide gives the specific formula used to calculate r2_score:

r2 formula

As you can see, the r2_score is simply 1 - (residual sum of squares)/(total sum of squares).

In the first case you specify, the residual sum of squares is equal to some number that...doesn't really matter. You can calculate it easily; it's about 0.09, which doesn't seem super high. However, due to the floating point error described above, the total sum of squares isn't exactly 0, but rather some very, very small number (think around 10^-28 -- very small).

Thus, when you divide residual sum of squares (around 0.09) by total sum of squares (a very small number), you're left with a very large number. Since that large number is subtracted from 1, you are left with a negative number of high magnitude as your r2_score output.

This imprecision in the calculation of total sum of squares does not occur in the second case, so the denominator is 0 and the function, seeing an undefined value from of the calculations, should return 0.

Vivek Kumar · Answer

Looking at the source code of r2_score, we can see the following lines (default weights assigned)

weight = 1
sample_weight = None

y_true = np.array([213.91666667,  213.91666667,  213.91666667,  213.91666667,    213.91666667, 213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667]).reshape(-1,1)

y_pred = np.array([213,  214,  214,  214,  214,  214,  214,  214,  214,  214]).reshape(-1,1)

numerator = (weight * (y_true - y_pred) ** 2).sum(axis=0,
                                                  dtype=np.float64)
denominator = (weight * (y_true - np.average(
    y_true, axis=0, weights=sample_weight)) ** 2).sum(axis=0,
                                                      dtype=np.float64)
nonzero_denominator = denominator != 0
nonzero_numerator = numerator != 0
valid_score = nonzero_denominator & nonzero_numerator
output_scores = np.ones([y_true.shape[1]])
output_scores[valid_score] = 1 - (numerator[valid_score] /
                                  denominator[valid_score])

return np.average(output_scores, weights=None)

The problematic line in your case is the denominator calculation.

For the first case:

denominator = (weight * (y_true - np.average(
    y_true, axis=0, weights=sample_weight)) ** 2).sum(axis=0,
                                                      dtype=np.float64)

print(denominator)

[  8.07793567e-27]

Its pretty small, but not 0.

For second case: its 0.

Since the denominator is 0, the r2_score is undefined and returns 0. Hope I'm clear.

Scikit-Learn giving incorrect R Squared value

Tags:

python

python-3.x

statistics

scikit-learn

Nandit Jain

2 Answers

nicmet

Vivek Kumar

Recent Activity

Donate For Us

Scikit-Learn giving incorrect R Squared value

Tags:

python

python-3.x

statistics

scikit-learn

Nandit Jain

2 Answers

nicmet

Vivek Kumar

Related questions

Recent Activity

Donate For Us