Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble training xgboost on categorical column

I am trying to run a Python notebook (link). At line below In [446]: where author train XGBoost, I am getting an error

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields StateHoliday, Assortment

# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Here is the minimal code for testing

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Link to train_store data file: Link 1

like image 617
arush1836 Avatar asked May 11 '19 07:05

arush1836


People also ask

Can XGBoost work on categorical data?

Starting from version 1.5, XGBoost has experimental support for categorical data available for public testing.

How do you handle categorical variables in XGBoost?

Xgboost with label encoding for categorical variablesLabel encoding is used to transform categorical values into numerical values. Split data into training data set and test data set. Tune xgboost hyper-parameters. Train xgboost model with train data set.

Can XGBoost handle categorical variables in R?

"When using XGBoost we need to convert categorical variables into numeric." Not always, no. If booster=='gbtree' (the default), then XGBoost can handle categorical variables encoded as numeric directly, without needing dummifying/one-hotting.

Do you have to one hot encode for XGBoost?

As far as XGBoost is concerned, one-hot-encoding becomes necessary as XGBoost accepts only numeric features.


1 Answers

I met the exactly same issue when i am doing Rossmann Sales Prediction Project. It seems like new version of xgboost do not accept the datatype of StateHoliday, Assortment, and StoreType. you can check the datatype as Mykhailo Lisovyi suggested by using

print(test_train.dtypes)

you need to replace test_train here with your X_train

you might can get

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64

the error raised up to object type. You can convert them with

from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))

Everything will go well after those steps.

like image 157
Zhi Yuan Avatar answered Oct 05 '22 06:10

Zhi Yuan