Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

does scikit-lean decision tree support unordered ('enum') multiclass features?

From the documentation, it appears that DecisionTreeClassifier supports multiclass features

DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification.

But, it appears that the decision rule in each node is based on 'greater then'

I'm trying to build trees with enum features (where there is no meaning for the absolute value of each feature - just equal \ not equal)

Is this supported in scikit-learn decision trees?

My current solution is to separate each feature to a set of binary features for each possible value - but i'm looking for a cleaner and more efficient solution.

like image 722
Ophir Yoktan Avatar asked Sep 11 '13 06:09

Ophir Yoktan


People also ask

Can decision trees be used for multiclass classification?

In short, yes, you can use decision trees for this problem. However there are many other ways to predict the result of multiclass problems. If you want to use decision trees one way of doing it could be to assign a unique integer to each of your classes.

What decision tree algorithm does Scikit learn use?

All decision trees use np. float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.

Which of the following module of Sklearn is used for dealing with decision trees?

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeRegressor for applying decision trees on regression problems.

Can decision trees handle categorical variables in Python?

Decision trees can handle both categorical and numerical variables at the same time as features, there is not any problem in doing that.


2 Answers

The term multiclass only affects the target variable: for the random forest in scikit-learn it is either categorical with an integer coding for multiclass classification or continuous for regression.

"Greater-than" rules apply to the input variables independently of the kind of target variable. If you have categorical input variables with a low dimensionality (e.g. less than a couple of tens of possible values) then it might be beneficial to use a one-hot-encoding for those. See:

  • OneHotEncoder if your categories are encoded as integers,
  • DictVectorizer if your categories are encoded as string labels in a list of python dict.

If some of the categorical variables have a high cardinality (e.g. thousands of possible values or more) then it has been shown experimentally that DecisionTreeClassifiers and better models based on them such as RandomForestClassifiers can be trained directly on the raw integer coding without converting it to a one-hot-encoding that would waste memory or model size.

like image 56
ogrisel Avatar answered Sep 23 '22 23:09

ogrisel


DecisionTreeClassifier is certainly capable of multiclass classification. The "greater than" just happens to be illustrated in that link, but arriving at that decision rule is a consequence of the affect it has on the information gain or the gini (see later in that page). Decision tree nodes generally have binary rules, so they typically take the form of some value being greater than another. The trick is transforming your data so it has good predictive values to compare.

To be clear, multiclass means your data (say a document) is to be classified as one of a set of possible classes. This is different from multilabel classification, where the document needs to be classified with several classes out of a set of possible classes. Most of the scikit-learn classifiers support multiclass, and it has a few meta-wrappers to accomplish multilabeling. You can also use probabilities (models with the predict_proba method) or decision function distances (models with the decision_function method) for multilabeling.

If you are saying you need to apply multiple labels to each datum (like ['red','sport','fast'] to cars), then you need to create a unique label for each possible combination to use trees/forests, which becomes your [0...K-1] set of classes. However, it implies that there is some predictive correlation in the data (for combined color, type, and speed in the cars example). For cars, there may be with red/yellow, fast sports cars, but unlikely for other 3-way combinations. Data may be strongly predictive for those few and very weak for the rest. Better off using SVM or LinearSVC and/or wrapping with OneVsRestClassifier or similar.

like image 40
wwwslinger Avatar answered Sep 20 '22 23:09

wwwslinger