Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing categorical data to Sklearn Decision Tree

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these

Some advantages of decision trees are:

(...)

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

But running the following script

import pandas as pd  from sklearn.tree import DecisionTreeClassifier  data = pd.DataFrame() data['A'] = ['a','a','b','a'] data['B'] = ['b','b','a','b'] data['C'] = [0, 0, 1, 0] data['Class'] = ['n','n','y','n']  tree = DecisionTreeClassifier() tree.fit(data[['A','B','C']], data['Class']) 

outputs the following error:

Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit     X = check_array(X, dtype=DTYPE, accept_sparse="csc")   File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array     array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: b 

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

like image 374
0xhfff Avatar asked Jun 29 '16 19:06

0xhfff


People also ask

Can categorical data be used in decision trees?

A categorical variable decision tree includes categorical target variables that are divided into categories. For example, the categories can be yes or no. The categories mean that every stage of the decision process falls into one category, and there are no in-betweens.

How do you handle a categorical variable in a decision tree?

If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is contiuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment.

Can sklearn handle categorical variables?

You can directly feed categorical variables to random forest using below approach: Firstly convert categories of feature to numbers using sklearn label encoder. Secondly convert label encoded feature type to string(object)

Can you use categorical variables in regression tree?

The general regression tree building methodology allows input variables to be a mixture of continuous and categorical variables. A decision tree is generated when each decision node in the tree contains a test on some input variable's value. The terminal nodes of the tree contain the predicted output variable values.


2 Answers

(This is just a reformat of my comment above from 2016...it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

like image 188
James Owers Avatar answered Oct 07 '22 06:10

James Owers


(..)

Able to handle both numerical and categorical data.

This only means that you can use

  • the DecisionTreeClassifier class for classification problems
  • the DecisionTreeRegressor class for regression.

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

import pandas as pd from sklearn.tree import DecisionTreeClassifier  data = pd.DataFrame() data['A'] = ['a','a','b','a'] data['B'] = ['b','b','a','b'] data['C'] = [0, 0, 1, 0] data['Class'] = ['n','n','y','n']  tree = DecisionTreeClassifier()  one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True) tree.fit(one_hot_data, data['Class']) 
like image 32
Guillaume Avatar answered Oct 07 '22 08:10

Guillaume