My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

Tags:

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below: https://www.kaggle.com/ludobenistant/hr-analytics

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values   ##Independent variable
y = dataset.iloc[:,9].values     ##Dependent variable

##Encoding the categorical variables

le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()


##splitting the dataset in training and testing data

from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)

y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

801

asked Oct 21 '17 15:10

Anant Vikram Singh

1 Answers

There are several issues with your question...

For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).

Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.

The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.

Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:

cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28

For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:

train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0

But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:

y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215

and you get a very nice R-squared in the test set (close to 1).

What about the MSE?

from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051

195

answered Sep 27 '22 21:09

desertnaut

Related questions
                            
                                Plotting mulitple lines on two y axis using Matplotlib
                            
                                Python Pandas Calculate average days between dates
                            
                                cv2.drawContours will not draw filled contour
                            
                                What is colocate_with used for in tensorflow?
                            
                                How to disable manual resizing of Tkinter's Treeview column?
                            
                                error in loading pickle
                            
                                Cartesian product of a pandas dataframe with itself
                            
                                Python - Unix commands not recognized in Jupyter
                            
                                Is it possible to convert [int, bool ,float] to ['int', 'bool','float'] with one single line command?
                            
                                Generators and files
                            
                                Adding title to the column of subplot below suptitle
                            
                                How to insert the text below subplot in matplotlib?
                            
                                Upgrading to Django 1.11.4 ImportError
                            
                                SMTP Authentication Error with Django on Heroku
                            
                                How to check dimensions of a numpy array?
                            
                                Change the text size of Bokeh label annotations
                            
                                How to perform bincount on an array of strings?
                            
                                How to import pyspark UDF into main class
                            
                                pandas function with isin
                            
                                Grouping by date range with pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

Tags:

python

machine-learning

scikit-learn

random-forest

Anant Vikram Singh

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us