I need to calculate the coefficient of determination for a linear regression model.
And I got a strange thing, result of calculation using definition and numpy functions differs to sklearn.metrics.r2_score result.
This code presents the difference :
import numpy as np
from sklearn.metrics import r2_score
y_true = np.array([2, -0.5, 2.5, 3, 0])
y_pred = np.array([2.5, 0.0, 3, 8, 0])
r2_score(y_true, y_pred)
>>> -1.6546391752577323
def my_r2_score(y_true, y_pred):
return 1 - np.sum((y_true - y_pred) ** 2) / np.sum((np.average(y_true) - y_true) ** 2)
def my_r2_score_var(y_true, y_pred):
return 1 - np.var(y_true - y_pred) / np.var(y_true)
print(my_r2_score(y_true, y_pred))
print(my_r2_score_var(y_true, y_pred))
>>>-1.6546391752577323
>>>-0.7835051546391754
Can any body explain this difference ?
my_r2_score_var is wrong, because np.sum((y_true - y_pred) ** 2)/5 is not equal to np.var(y_true - y_pred).
>>> np.sum((y_true - y_pred) ** 2)/5
5.15
>>> np.var(y_true - y_pred)
3.46
What you are doing with np.var(y_true - y_pred) is:
>>> np.sum(((y_true - y_pred) - np.average(y_true - y_pred))**2)/5
3.46
np.sum((y_true - y_pred) ** 2) is the correct RSS.
You assumed np.var(y_true - y_pred) to be the mean RSS (RSS/5 here), but it isn't.
However, np.var(y_true) happens to be the mean TSS. So you got the RSS part of the 1 - RSS/TSS formula wrong.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With