Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there feature selection algorithms that can be applied to categorical data inputs?

I am training a neural network which has 10 or so categorical inputs. After one-hot encoding these categorical inputs I end up feeding around 500 inputs into the network.

I would love to be able to ascertain the importance of each of my categorical inputs. Scikit-learn has numerous feature importance algorithms, however can any of these be applied to categorical data inputs? All of the examples use numerical inputs.

I could apply these methods to the one-hot encoded inputs, but how would I extract the meaning after applying to binarised inputs? How does one go about judging feature importance on categorical inputs?

like image 924
A555h55 Avatar asked Feb 17 '17 17:02

A555h55


People also ask

Which algorithms can handle categorical data?

Logistic Regression. K Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree.

Which feature selection method will be used if we have numerical input variable and categorical output variable?

ANOVA f-test Feature Selection Importantly, ANOVA is used when one variable is numeric and one is categorical, such as numerical input variables and a classification target variable in a classification task.

Can XGBoost take categorical features in input?

XGBoost does not support categorical variables natively, so it is necessary to encode them prior to training.

Which of the following algorithms works best with categorical features?

Logistic Regression is a classification algorithm so it is best applied to categorical data.


1 Answers

Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.

Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.

You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.

In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by @Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.

Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.

like image 182
DaL Avatar answered Sep 21 '22 14:09

DaL