When using <code>XGBoost</code> we need to convert categorical variables into numeric. Would there be any difference in performance/evaluation metrics between the methods of: <ol> <li>dummifying your categorical variables</li> <li>encoding your categorical variables from e.g. (a,b,c) to (1,2,3)</li> </ol> ALSO: Would there be any reasons not to go with method 2 by using for example <code>labelencoder</code>?

<code>xgboost</code> only deals with numeric columns. if you have a feature <code>[a,b,b,c]</code> which describes a categorical variable (i.e. no numeric relationship) Using LabelEncoder you will simply have this: <pre class="prettyprint"><code>array([0, 1, 1, 2]) </code></pre> <code>Xgboost</code> will wrongly interpret this feature as having a numeric relationship! This just maps each string <code>('a','b','c')</code> to an integer, nothing more. Proper way Using OneHotEncoder you will eventually get to this: <pre class="prettyprint"><code>array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 1., 0.], [ 0., 0., 1.]]) </code></pre> This is the proper representation of a categorical variable for <code>xgboost</code> or any other machine learning tool. Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion). Method #2 in above question will not represent the data properly

XGBoost Categorical Variables: Dummification vs encoding

1 Answers

xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2])

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:

array([[ 1.,  0.,  0.],        [ 0.,  1.,  0.],        [ 0.,  1.,  0.],        [ 0.,  0.,  1.]])

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly

153

answered Nov 03 '22 23:11

T. Scharf

Related questions
                            
                                Error handling when importing modules
                            
                                more than 9 subplots in matplotlib
                            
                                Creating a nested dictionary from a flattened dictionary
                            
                                How to PATCH a single field using Django Rest Framework?
                            
                                Sorting columns and selecting top n rows in each group pandas dataframe
                            
                                sort values and return list of keys from dict python [duplicate]
                            
                                What is inf and nan?
                            
                                AttributeError: 'dict' object has no attribute 'predictors'
                            
                                How to perform a left join in SQLALchemy?
                            
                                Can i get console input without echo in python?
                            
                                Does PyPy translate itself?
                            
                                Timeout on subprocess readline in Python
                            
                                Malformed String ValueError ast.literal_eval() with String representation of Tuple
                            
                                Convert BytesIO into File
                            
                                How to extract PDF fields from a filled out form in Python?
                            
                                adding dummy columns to the original dataframe
                            
                                How to use Selenium with Python?
                            
                                where to put freeze_support() in a Python script?
                            
                                How to upgrade Python version to 3.7? [closed]
                            
                                SQLAlchemy Obtain Primary Key With Autoincrement Before Commit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

XGBoost Categorical Variables: Dummification vs encoding

Tags:

python

categorical-data

xgboost

ishido

People also ask

1 Answers

T. Scharf

Recent Activity

Donate For Us