I am training a neural network which has 10 or so categorical inputs. After one-hot encoding these categorical inputs I end up feeding around 500 inputs into the network.
I would love to be able to ascertain the importance of each of my categorical inputs. Scikit-learn has numerous feature importance algorithms, however can any of these be applied to categorical data inputs? All of the examples use numerical inputs.
I could apply these methods to the one-hot encoded inputs, but how would I extract the meaning after applying to binarised inputs? How does one go about judging feature importance on categorical inputs?
Logistic Regression. K Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree.
ANOVA f-test Feature Selection Importantly, ANOVA is used when one variable is numeric and one is categorical, such as numerical input variables and a classification target variable in a classification task.
XGBoost does not support categorical variables natively, so it is necessary to encode them prior to training.
Logistic Regression is a classification algorithm so it is best applied to categorical data.
Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.
Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.
You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.
In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by @Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.
Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With