I'm making a code to solve a simple problem of predict the probability of an item missing from an inventory.
I'm using the XGBoost prediction model to do this.
I have the data split in two .csv files, one with the Train Data and other with the Test Data
Here is the code:
import pandas as pd
import numpy as np
train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
test = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)
X_train, y_train = train.drop('isBackorder', axis=1), train['isBackorder']
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 10, alpha = 10, n_estimators = 10)
xg_reg.fit(X_train,y_train)
y_pred = xg_reg.predict(test)
# Create file for the competition submission
test['isBackorder'] = y_pred
pred = test['isBackorder'].reset_index()
pred.to_csv('competitionsubmission.csv',index=False)
And here is the functions where i try to measure the accuracy of the problem (Using RMSE and the accuracy_scores function and do a KFold cross validation
#RMSE
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print("RMSE: %f" % (rmse))
#Accuracy
from sklearn.metrics import accuracy_score
# make predictions for test data
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
#KFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# CV model
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
But i'm having some problems.
None of the accuracy test above works.
When using the RMSE function and the Accuracy function, the following error appears: ValueError: Found input variables with inconsistent numbers of samples: [1350955, 578982]
I guess that the Train and Test Data split structure that i'm using are not correct.
Since i don't have a y_test (and i don't know how to create it in my problem), i can't use it at the function's above parameters.
The K Fold validation isn't working too.
Can someone help me PLEASE?
XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.
Another way to perform cross-validation with XGBoost is to use XGBoost's own non-Scikit-learn compatible API. “Non-Scikit-learn compatible” means that here we do not use the Scikit-learn cross_val_score() function, instead we use XGBoost's cv() function with explicitly created DMatrices.
Make Predictions with XGBoost Model We can make predictions using the fit model on the test dataset. To make predictions we use the scikit-learn function model. predict(). By default, the predictions made by XGBoost are probabilities.
XGBoost can be used directly for regression predictive modeling.
Your only issue is that you need validation data. You can't measure accuracy between the predict(x_test)
and a non-existing y_test
. Use sklearn.model_selection.train_test_split
to make a validation set based on your training data. You will have a train, validation, and test set. You can evaluate the performance of your model on the validation set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y)
Other remarks:
Accuracy makes no sense here because you're trying to predict on continuous values. Only use accuracy for categorical variables.
At a minimum, this could work:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
test_data = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o '
'periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)
x, y = train.drop('isBackorder', axis=1), train['isBackorder']
X_train, X_test, y_train, y_test = train_test_split(x, y)
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 10, alpha = 10, n_estimators = 10)
xg_reg.fit(X_train,y_train)
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
y_test_pred = xg_reg.predict(X_test)
mse = mean_squared_error(y_test_pred, y_test)
y_pred = xg_reg.predict(X_test)
pd.DataFrame(y_pred).to_csv('competitionsubmission.csv',index=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With