Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XGBoost Categorical Variables: Dummification vs encoding

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the methods of:

  1. dummifying your categorical variables
  2. encoding your categorical variables from e.g. (a,b,c) to (1,2,3)

ALSO:

Would there be any reasons not to go with method 2 by using for example labelencoder?

like image 638
ishido Avatar asked Dec 14 '15 10:12

ishido


People also ask

Do we need to encode categorical variables for XGBoost?

"When using XGBoost we need to convert categorical variables into numeric." Not always, no. If booster=='gbtree' (the default), then XGBoost can handle categorical variables encoded as numeric directly, without needing dummifying/one-hotting.

How do you handle categorical variables in XGBoost?

Xgboost with label encoding for categorical variablesLabel encoding is used to transform categorical values into numerical values. Split data into training data set and test data set. Tune xgboost hyper-parameters. Train xgboost model with train data set.

Which is the best way to encode categorical variables?

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods.

Does XGBoost require hot encoding?

As far as XGBoost is concerned, one-hot-encoding becomes necessary as XGBoost accepts only numeric features.


1 Answers

xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2]) 

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:

array([[ 1.,  0.,  0.],        [ 0.,  1.,  0.],        [ 0.,  1.,  0.],        [ 0.,  0.,  1.]]) 

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly

like image 153
T. Scharf Avatar answered Nov 03 '22 23:11

T. Scharf