Trouble training xgboost on categorical column

Tags:

I am trying to run a Python notebook (link). At line below In [446]: where author train XGBoost, I am getting an error

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields StateHoliday, Assortment

# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Here is the minimal code for testing

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Link to train_store data file: Link 1

617

asked May 11 '19 07:05

arush1836

1 Answers

I met the exactly same issue when i am doing Rossmann Sales Prediction Project. It seems like new version of xgboost do not accept the datatype of StateHoliday, Assortment, and StoreType. you can check the datatype as Mykhailo Lisovyi suggested by using

print(test_train.dtypes)

you need to replace test_train here with your X_train

you might can get

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64

the error raised up to object type. You can convert them with

from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))

Everything will go well after those steps.

157

answered Oct 05 '22 06:10

Zhi Yuan

Related questions
                            
                                pyspark Window.partitionBy vs groupBy
                            
                                Uploading file to AWS S3 through Chalice API call
                            
                                How to use functional programming to iterate and find maximum product of five consecutive numbers in a list?
                            
                                python pandas merge multiple csv files
                            
                                How to monitor python's concurrent.futures.ProcessPoolExecutor?
                            
                                Why is the block size for Python httplib's reads hard coded as 8192 bytes
                            
                                Choosing subset of farthest points in given set of points
                            
                                replace values in xarray dataset with None
                            
                                Unittest Django: Mock external API, what is proper way?
                            
                                Randomly shuffle items in each row of numpy array
                            
                                Why is using a key function so much slower?
                            
                                Readonly form field in edit view - Flask-Admin
                            
                                what's the difference between torch.Tensor() vs torch.empty() in pytorch?
                            
                                Why would a pytest factory as fixture be used over a factory function?
                            
                                All dependencies are not downloaded with "pip download"
                            
                                What is the "right" way to close a Dask LocalCluster?
                            
                                Training a Keras model from batches of .npy files using generator?
                            
                                What does sklearn "RidgeClassifier" do?
                            
                                Multi-Layer .gdb files in Python?
                            
                                How to set jupyter notebook to open on browser automatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trouble training xgboost on categorical column

Tags:

python

categorical-data

xgboost

arush1836

People also ask

1 Answers

Zhi Yuan

Recent Activity

Donate For Us