How do we generally determine whether a given variable( feature) in a dataset is important or not for accurately doing the prediction task ?
What all tests should be conducted and used to determine suitability of a variable in prediction ?
Suppose I have 32 features and one of them is 'income', then how should I start analysing its importance. Is there any use in comparing this feature with other features, because in the end its the collection of variables that will help in prediction not these two variables which are compared ...
Start here (especially para Feature Selection Tutorials and Recipes):
http://machinelearningmastery.com/an-introduction-to-feature-selection/
And there (lists the number of available methods for further googling):
https://en.wikipedia.org/wiki/Feature_selection
Also good article with more general discussion on the issue:
http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
Also the simplest method is to try to fit a RandomForest or Gradient Boosting Machine on your dataset. These algorithms automatically evaluate the importance of each feature during the fitting, after the classifier or regressor is fit you could access (in scikit-learn) its feature_importances_ property - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With