Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below: https://www.kaggle.com/ludobenistant/hr-analytics

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values   ##Independent variable
y = dataset.iloc[:,9].values     ##Dependent variable

##Encoding the categorical variables

le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()


##splitting the dataset in training and testing data

from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)

y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
like image 801
Anant Vikram Singh Avatar asked Oct 21 '17 15:10

Anant Vikram Singh


People also ask

What does a negative R2 score mean?

R2 is negative only when the chosen model does not follow the trend of the data. It seems that your model may be giving better performance because of over-fitting. It can be a case of over-fitting in the model. It can happen because of various reasons like small dataset and noise in the dataset.

What if cross validation score is in negative?

Generally that means that the model you have fit is worse than the null hypothesis, that a straight line with slope of 0 is a better fit than the model you created.

What is a good cross validation R2 score?

K-fold CV on Boston Housing data set The cross_val_score calculates the R squared metric for the applied model. R squared error close to 1 implies a better fit and less error.


1 Answers

There are several issues with your question...

For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).

Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.

The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.

Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:

cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28

For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:

train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0

But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:

y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215

and you get a very nice R-squared in the test set (close to 1).

What about the MSE?

from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051
like image 195
desertnaut Avatar answered Sep 27 '22 21:09

desertnaut