I know decision tree has feature_importance attribute calculated by Gini and it could be used to check which features are more important.
However, for application in scikit-learn or Spark, it only accepts numeric attribute, so I have to transfer string attribute to numeric attribute and then do one-hot encoder on that. When features are put into decision tree model, it's 0-1 encoded other than original format, my question is, how to explain feature importance for original attributes? should I avoid one-hot encoder when try to explain feature importance?
Thanks.
Conceptually, you may want to use something along the lines of permutation importance. The basic idea, is that you take your original dataset, and randomly shuffle the values of each column 1 at a time. Then, you score your perturbed data with the model and compare the performance to the original performance. If done 1 column at a time, you can assess the performance hit you take by destroying each variable, indexing it to the variable that had the most loss (which would become 1, or 100%). If you can do this to your original dataset, prior to the 1 hot encoding, then you'll be getting an importance measure that groups them together overall.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With