Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine importance of a variable in data analysis

How do we generally determine whether a given variable( feature) in a dataset is important or not for accurately doing the prediction task ?

What all tests should be conducted and used to determine suitability of a variable in prediction ?

Suppose I have 32 features and one of them is 'income', then how should I start analysing its importance. Is there any use in comparing this feature with other features, because in the end its the collection of variables that will help in prediction not these two variables which are compared ...

like image 710
mach Avatar asked Jan 25 '26 13:01

mach


1 Answers

Start here (especially para Feature Selection Tutorials and Recipes):

http://machinelearningmastery.com/an-introduction-to-feature-selection/

And there (lists the number of available methods for further googling):

https://en.wikipedia.org/wiki/Feature_selection

Also good article with more general discussion on the issue:

http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf

Also the simplest method is to try to fit a RandomForest or Gradient Boosting Machine on your dataset. These algorithms automatically evaluate the importance of each feature during the fitting, after the classifier or regressor is fit you could access (in scikit-learn) its feature_importances_ property - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

like image 187
Maksim Khaitovich Avatar answered Jan 27 '26 12:01

Maksim Khaitovich



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!