Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to explain feature importance after one-hot encode used for decision tree

I know decision tree has feature_importance attribute calculated by Gini and it could be used to check which features are more important.

However, for application in scikit-learn or Spark, it only accepts numeric attribute, so I have to transfer string attribute to numeric attribute and then do one-hot encoder on that. When features are put into decision tree model, it's 0-1 encoded other than original format, my question is, how to explain feature importance for original attributes? should I avoid one-hot encoder when try to explain feature importance?

Thanks.

like image 757
linpingta Avatar asked Oct 14 '16 15:10

linpingta


1 Answers

Conceptually, you may want to use something along the lines of permutation importance. The basic idea, is that you take your original dataset, and randomly shuffle the values of each column 1 at a time. Then, you score your perturbed data with the model and compare the performance to the original performance. If done 1 column at a time, you can assess the performance hit you take by destroying each variable, indexing it to the variable that had the most loss (which would become 1, or 100%). If you can do this to your original dataset, prior to the 1 hot encoding, then you'll be getting an importance measure that groups them together overall.

like image 199
Josh Avatar answered Sep 20 '22 00:09

Josh