I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?
I want to feed gender as a feature for my model. However, gender can take on three values: M, F of np.nan. If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?
Imagine max_features = 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender was chosen. If gender is split into three columns (gender_M, gender_F, gender_NA), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?
If max_features is set to a value lower than the actual amount of columns (which is the advisable approach, see the recommended values for max_features in the docs), then yes, there is a chance that for a given estimator in the random forest only a subset of the dummy columns is considered.
But that is not necessarily too bad. In decision trees, a feature is selected as node at a given level aiming at optimizing some metric, independently from the other features, that is, only considering the actual feature and the target. So in a sense the model will not treat these dummy columns as belonging to the same feature.
In general though, the best approach for binary features is to come up with an appropriate method to fill missing values, and convert it into a single column encoded to 0s and 1s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With