There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
But running the following script
import pandas as pd from sklearn.tree import DecisionTreeClassifier data = pd.DataFrame() data['A'] = ['a','a','b','a'] data['B'] = ['b','b','a','b'] data['C'] = [0, 0, 1, 0] data['Class'] = ['n','n','y','n'] tree = DecisionTreeClassifier() tree.fit(data[['A','B','C']], data['Class'])
outputs the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit X = check_array(X, dtype=DTYPE, accept_sparse="csc") File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: b
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?
A categorical variable decision tree includes categorical target variables that are divided into categories. For example, the categories can be yes or no. The categories mean that every stage of the decision process falls into one category, and there are no in-betweens.
If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is contiuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment.
You can directly feed categorical variables to random forest using below approach: Firstly convert categories of feature to numbers using sklearn label encoder. Secondly convert label encoded feature type to string(object)
The general regression tree building methodology allows input variables to be a mixture of continuous and categorical variables. A decision tree is generated when each decision node in the tree contains a test on some input variable's value. The terminal nodes of the tree contain the predicted output variable values.
(This is just a reformat of my comment above from 2016...it still holds true.)
The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier()
will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
Using a OneHotEncoder
is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.
(..)
Able to handle both numerical and categorical data.
This only means that you can use
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
import pandas as pd from sklearn.tree import DecisionTreeClassifier data = pd.DataFrame() data['A'] = ['a','a','b','a'] data['B'] = ['b','b','a','b'] data['C'] = [0, 0, 1, 0] data['Class'] = ['n','n','y','n'] tree = DecisionTreeClassifier() one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True) tree.fit(one_hot_data, data['Class'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With