How to measure xgboost regressor accuracy using accuracy_score (or other suggested function)

Tags:

I'm making a code to solve a simple problem of predict the probability of an item missing from an inventory.

I'm using the XGBoost prediction model to do this.

I have the data split in two .csv files, one with the Train Data and other with the Test Data

Here is the code:

    import pandas as pd
    import numpy as np


    train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
    test = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)


    X_train, y_train = train.drop('isBackorder', axis=1), train['isBackorder']

    import xgboost as xgb
    xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                    max_depth = 10, alpha = 10, n_estimators = 10)
    xg_reg.fit(X_train,y_train)


    y_pred = xg_reg.predict(test)

    # Create file for the competition submission
    test['isBackorder'] = y_pred
    pred = test['isBackorder'].reset_index()
    pred.to_csv('competitionsubmission.csv',index=False)

And here is the functions where i try to measure the accuracy of the problem (Using RMSE and the accuracy_scores function and do a KFold cross validation

#RMSE
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print("RMSE: %f" % (rmse))


#Accuracy
from sklearn.metrics import accuracy_score

# make predictions for test data
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))


#KFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# CV model
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

But i'm having some problems.

None of the accuracy test above works.

When using the RMSE function and the Accuracy function, the following error appears: ValueError: Found input variables with inconsistent numbers of samples: [1350955, 578982]

I guess that the Train and Test Data split structure that i'm using are not correct.

Since i don't have a y_test (and i don't know how to create it in my problem), i can't use it at the function's above parameters.

The K Fold validation isn't working too.

Can someone help me PLEASE?

859

asked Dec 03 '19 22:12

Pedro Nader

1 Answers

Your only issue is that you need validation data. You can't measure accuracy between the predict(x_test) and a non-existing y_test. Use sklearn.model_selection.train_test_split to make a validation set based on your training data. You will have a train, validation, and test set. You can evaluate the performance of your model on the validation set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y)

Other remarks:

Accuracy makes no sense here because you're trying to predict on continuous values. Only use accuracy for categorical variables.

At a minimum, this could work:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
test_data = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o '
                    'periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)

x, y = train.drop('isBackorder', axis=1), train['isBackorder']
X_train, X_test, y_train, y_test = train_test_split(x, y)

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 10)

xg_reg.fit(X_train,y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
y_test_pred = xg_reg.predict(X_test)

mse = mean_squared_error(y_test_pred, y_test)

y_pred = xg_reg.predict(X_test)

pd.DataFrame(y_pred).to_csv('competitionsubmission.csv',index=False)

112

answered Oct 16 '22 13:10

Nicolas Gervais

Related questions
                            
                                How do I install hunspell on windows10?
                            
                                Finding the proper Python type hint, for instance, the signature of the built-in function map()
                            
                                Why am I getting "An error ocurred while starting the kernel" in Spyder while running Python?
                            
                                Python Setuptools and PBR - how to create a package release using the git tag as the version?
                            
                                Delete row/column from Excel with xlsxwriter
                            
                                Bert Embedding Layer raises `Type Error: unsupported operand type(s) for +: 'None Type' and 'int'` with BiLSTM
                            
                                How to build TensorFlow lite with select TensorFlow ops for x86_64 systems?
                            
                                How to extract data from a Tweepy object into a pandas dataframe?
                            
                                Generate a column based on a constraint in pandas
                            
                                Why does my Streamlit application open multiple times?
                            
                                How to convert nested json structure to dataframe
                            
                                Can I get() or xcom.pull() a variable in the MAIN part of an Airflow script (outside a PythonOperator)?
                            
                                Sort lines in text file between patterns
                            
                                Where is the class list_iterator defined?
                            
                                mount error when trying to access the Azure DBFS file system in Azure Databricks
                            
                                How to load numpy array in a tensorflow dataset
                            
                                pytorch debugging timeout with PyCharm
                            
                                Fixing 'Import [module] could not be resolved' in pyright
                            
                                Python: How to automate 'Allow' flash player content in Firefox?
                            
                                Python does not allow annotating the types of variables when unpacking

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to measure xgboost regressor accuracy using accuracy_score (or other suggested function)

Tags:

python

scikit-learn

training-data

xgboost

k-fold

Pedro Nader

People also ask

1 Answers

Nicolas Gervais

Recent Activity

Donate For Us