Passing categorical data to Sklearn Decision Tree

Tags:

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these

Some advantages of decision trees are:

(...)

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

But running the following script

import pandas as pd  from sklearn.tree import DecisionTreeClassifier  data = pd.DataFrame() data['A'] = ['a','a','b','a'] data['B'] = ['b','b','a','b'] data['C'] = [0, 0, 1, 0] data['Class'] = ['n','n','y','n']  tree = DecisionTreeClassifier() tree.fit(data[['A','B','C']], data['Class'])

outputs the following error:

Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit     X = check_array(X, dtype=DTYPE, accept_sparse="csc")   File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array     array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: b

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

374

asked Jun 29 '16 19:06

0xhfff

2 Answers

(This is just a reformat of my comment above from 2016...it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

188

answered Oct 07 '22 06:10

James Owers

(..)

Able to handle both numerical and categorical data.

This only means that you can use

the DecisionTreeClassifier class for classification problems
the DecisionTreeRegressor class for regression.

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

import pandas as pd from sklearn.tree import DecisionTreeClassifier  data = pd.DataFrame() data['A'] = ['a','a','b','a'] data['B'] = ['b','b','a','b'] data['C'] = [0, 0, 1, 0] data['Class'] = ['n','n','y','n']  tree = DecisionTreeClassifier()  one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True) tree.fit(one_hot_data, data['Class'])

answered Oct 07 '22 08:10

Guillaume

Related questions
                            
                                Django: Calling .update() on a single model instance retrieved by .get()?
                            
                                Error message "Linter pylint is not installed"
                            
                                Adding docstrings to namedtuples?
                            
                                Django equivalent for count and group by
                            
                                Does Python have an "or equals" function like ||= in Ruby?
                            
                                How can I create a tmp file in Python?
                            
                                Python requests. 403 Forbidden
                            
                                Transform "list of tuples" into a flat list or a matrix
                            
                                How can I remove text within parentheses with a regex?
                            
                                How to require login for Django Generic Views?
                            
                                Python - How to check list monotonicity
                            
                                Show only certain items in legend Python Matplotlib
                            
                                PyCharm not recognizing Python files
                            
                                Docker Compose Up gives "The system cannot find the file specified." error
                            
                                Angles between two n-dimensional vectors in Python
                            
                                Why is Ruby more suitable for Rails than Python? [closed]
                            
                                Why am I getting ImportError: No module named pip ' right after installing pip?
                            
                                Python and the Singleton Pattern [duplicate]
                            
                                How do you get PyPy, Django and PostgreSQL to work together?
                            
                                What is your favorite Python mocking library? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Passing categorical data to Sklearn Decision Tree

Tags:

python

scikit-learn

decision-tree

0xhfff

People also ask

2 Answers

James Owers

Guillaume

Recent Activity

Donate For Us