Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to measure xgboost regressor accuracy using accuracy_score (or other suggested function)

I'm making a code to solve a simple problem of predict the probability of an item missing from an inventory.

I'm using the XGBoost prediction model to do this.

I have the data split in two .csv files, one with the Train Data and other with the Test Data

Here is the code:

    import pandas as pd
    import numpy as np


    train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
    test = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)


    X_train, y_train = train.drop('isBackorder', axis=1), train['isBackorder']

    import xgboost as xgb
    xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                    max_depth = 10, alpha = 10, n_estimators = 10)
    xg_reg.fit(X_train,y_train)


    y_pred = xg_reg.predict(test)

    # Create file for the competition submission
    test['isBackorder'] = y_pred
    pred = test['isBackorder'].reset_index()
    pred.to_csv('competitionsubmission.csv',index=False)

And here is the functions where i try to measure the accuracy of the problem (Using RMSE and the accuracy_scores function and do a KFold cross validation

#RMSE
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print("RMSE: %f" % (rmse))


#Accuracy
from sklearn.metrics import accuracy_score

# make predictions for test data
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))


#KFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# CV model
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

But i'm having some problems.

None of the accuracy test above works.

When using the RMSE function and the Accuracy function, the following error appears: ValueError: Found input variables with inconsistent numbers of samples: [1350955, 578982]

I guess that the Train and Test Data split structure that i'm using are not correct.

Since i don't have a y_test (and i don't know how to create it in my problem), i can't use it at the function's above parameters.

The K Fold validation isn't working too.

Can someone help me PLEASE?

like image 859
Pedro Nader Avatar asked Dec 03 '19 22:12

Pedro Nader


People also ask

How do I make XGBoost more accurate?

XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.

How do you do cross validation in XGBoost?

Another way to perform cross-validation with XGBoost is to use XGBoost's own non-Scikit-learn compatible API. “Non-Scikit-learn compatible” means that here we do not use the Scikit-learn cross_val_score() function, instead we use XGBoost's cv() function with explicitly created DMatrices.

How do you predict XGBoost?

Make Predictions with XGBoost Model We can make predictions using the fit model on the test dataset. To make predictions we use the scikit-learn function model. predict(). By default, the predictions made by XGBoost are probabilities.

Can we use XGBoost for regression?

XGBoost can be used directly for regression predictive modeling.


1 Answers

Your only issue is that you need validation data. You can't measure accuracy between the predict(x_test) and a non-existing y_test. Use sklearn.model_selection.train_test_split to make a validation set based on your training data. You will have a train, validation, and test set. You can evaluate the performance of your model on the validation set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y)

Other remarks:

Accuracy makes no sense here because you're trying to predict on continuous values. Only use accuracy for categorical variables.

At a minimum, this could work:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
test_data = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o '
                    'periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)

x, y = train.drop('isBackorder', axis=1), train['isBackorder']
X_train, X_test, y_train, y_test = train_test_split(x, y)

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 10)

xg_reg.fit(X_train,y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
y_test_pred = xg_reg.predict(X_test)

mse = mean_squared_error(y_test_pred, y_test)

y_pred = xg_reg.predict(X_test)

pd.DataFrame(y_pred).to_csv('competitionsubmission.csv',index=False)
like image 112
Nicolas Gervais Avatar answered Oct 16 '22 13:10

Nicolas Gervais