From the documentation, it appears that <code>DecisionTreeClassifier</code> supports multiclass features <blockquote> DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification. </blockquote> But, it appears that the decision rule in each node is based on 'greater then' I'm trying to build trees with enum features (where there is no meaning for the absolute value of each feature - just equal \ not equal) Is this supported in scikit-learn decision trees? My current solution is to separate each feature to a set of binary features for each possible value - but i'm looking for a cleaner and more efficient solution.

The term multiclass only affects the target variable: for the random forest in scikit-learn it is either categorical with an integer coding for multiclass classification or continuous for regression. "Greater-than" rules apply to the input variables independently of the kind of target variable. If you have categorical input variables with a low dimensionality (e.g. less than a couple of tens of possible values) then it might be beneficial to use a one-hot-encoding for those. See: <ul> <li> OneHotEncoder if your categories are encoded as integers,</li> <li> DictVectorizer if your categories are encoded as string labels in a list of python dict.</li> </ul> If some of the categorical variables have a high cardinality (e.g. thousands of possible values or more) then it has been shown experimentally that <code>DecisionTreeClassifier</code>s and better models based on them such as <code>RandomForestClassifier</code>s can be trained directly on the raw integer coding without converting it to a one-hot-encoding that would waste memory or model size.

does scikit-lean decision tree support unordered ('enum') multiclass features?

2 Answers

The term multiclass only affects the target variable: for the random forest in scikit-learn it is either categorical with an integer coding for multiclass classification or continuous for regression.

"Greater-than" rules apply to the input variables independently of the kind of target variable. If you have categorical input variables with a low dimensionality (e.g. less than a couple of tens of possible values) then it might be beneficial to use a one-hot-encoding for those. See:

OneHotEncoder if your categories are encoded as integers,
DictVectorizer if your categories are encoded as string labels in a list of python dict.

If some of the categorical variables have a high cardinality (e.g. thousands of possible values or more) then it has been shown experimentally that DecisionTreeClassifiers and better models based on them such as RandomForestClassifiers can be trained directly on the raw integer coding without converting it to a one-hot-encoding that would waste memory or model size.

answered Sep 23 '22 23:09

ogrisel

DecisionTreeClassifier is certainly capable of multiclass classification. The "greater than" just happens to be illustrated in that link, but arriving at that decision rule is a consequence of the affect it has on the information gain or the gini (see later in that page). Decision tree nodes generally have binary rules, so they typically take the form of some value being greater than another. The trick is transforming your data so it has good predictive values to compare.

To be clear, multiclass means your data (say a document) is to be classified as one of a set of possible classes. This is different from multilabel classification, where the document needs to be classified with several classes out of a set of possible classes. Most of the scikit-learn classifiers support multiclass, and it has a few meta-wrappers to accomplish multilabeling. You can also use probabilities (models with the predict_proba method) or decision function distances (models with the decision_function method) for multilabeling.

If you are saying you need to apply multiple labels to each datum (like ['red','sport','fast'] to cars), then you need to create a unique label for each possible combination to use trees/forests, which becomes your [0...K-1] set of classes. However, it implies that there is some predictive correlation in the data (for combined color, type, and speed in the cars example). For cars, there may be with red/yellow, fast sports cars, but unlikely for other 3-way combinations. Data may be strongly predictive for those few and very weak for the rest. Better off using SVM or LinearSVC and/or wrapping with OneVsRestClassifier or similar.

answered Sep 20 '22 23:09

wwwslinger

Related questions
                            
                                What is this operator *= -1
                            
                                How to programmatically get the MD5 Checksum of Amazon S3 file using boto
                            
                                Python: check if any word in a list of words matches any pattern in a list of regular expression patterns
                            
                                SQLAlchemy: Single Table Inheritance, same column in childs
                            
                                In praw, I'm trying to print the comment body, but what if I encounter an empty comment?
                            
                                GridSearchCV extremely slow on small dataset in scikit-learn
                            
                                Calculating power for Decimals in Python
                            
                                Combine two dictionaries, concatenate string values?
                            
                                Django SimpleLazyObject
                            
                                Python's "re" module not working?
                            
                                nose framework command line regex pattern matching doesnt work(-e,-m ,-i)
                            
                                How can I create functions that handle polynomials?
                            
                                What's the fastest way to merge multiple csv files by column?
                            
                                Tastypie migration error
                            
                                Exception TypeError warning sometimes shown, sometimes not when using throw method of generator
                            
                                Why "None" has the same effect as "np.newaxis"? [duplicate]
                            
                                Vim obtain string between visual selection range with Python
                            
                                Is there a way to check if a module is being loaded by multiprocessing standard module in Windows?
                            
                                fnmatch and recursive path match with `**`
                            
                                Refer to multiple Models in View/Template in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

does scikit-lean decision tree support unordered ('enum') multiclass features?

Tags:

python

scikit-learn

decision-tree

Ophir Yoktan

People also ask

2 Answers

ogrisel

wwwslinger

Recent Activity

Donate For Us