Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Attribute's predictive capacity for a particular target in Python, using feature selection in Sklearn

Are there any feature selection methods in Scikit-Learn (or algos in general) that give weights of an attribute's ability/predictive-capacity/importance to predict a specific target? For example, the from sklearn.datasets import load_iris, ranking each of the 4 attributes weights to predict the 3 iris species separately but for much more complex datasets w/ ~1k-10k attributes.

I'm looking for something analogous to the feature_importances_ from RandomForestClassifier. However, RandomForestClassifer gives weights to each attribute for the entire prediction process. The weights do not need to add up to one but I want to find a way to correlate a specific subset of attributes to a specific target.

First I tried "overfitting" the models to enrich for a specific target but the results didn't seem to change much between targets. Second, I tried going the ordination route by finding which attributes have the greatest variation but that doesn't directly translate to predictive capacity. Third, I tried sparse models but I encountered the same problem as using feature_importances_.

A link to an example or tutorial that does exactly this is sufficient. Possibly a tutorial on how to traverse decision trees in a random forest and store the nodes that are predictive of specific targets.

like image 433
O.rka Avatar asked Nov 23 '16 21:11

O.rka


People also ask

What is predict () sklearn?

The Sklearn 'Predict' Method Predicts an Output That being the case, it provides a set of tools for doing things like training and evaluating machine learning models. And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).


2 Answers

Single targets

Most models are hardly black boxes, so if you are interested in a specific target, you could simply look at the coefficients of the model and do the model calculation by hand to understand how the model came to its output. E.g.:

  • For a linear model you simply need to multiply with the coefficients and add the bias
  • For a neural network you need to know all the coefficients and activation functions and do a few calculations to have a look at how the inputs are translated into new 'features' in the hidden layers and then finally outputs
  • For a random forest you need to look at the decision boundaries of all the trees in the forest
  • Etc.

Based on such analysis, you could decide what inputs you consider most important.

Sensitivity analysis

More useful, perhaps, would be to look at how the model output changes when your input values change. This will give you a higher-level insight into how important and sensitive the inputs are. This concept is called sensitivity analysis. For most methods, you could simply do some random sampling on the inputs and analyze the outputs.

This can be useful for feature selection, as insensitive inputs are candidates for pruning.

Looking back into the model

Sensitivity analysis is based on the idea of perturbing the input to the model to learn something about how the model comes up with its output. The other way of looking at things would be to take the output and reason backwards into the model and finally the inputs. Such an approach is:

  1. Highly specific to the model technique in question
  2. Complex, since the more non-linear a model is, and the more feature interactions the model has, the harder it is to 'untangle things'.

For a discussion specific to Random Forests, have a look at this Q&A.

Visualization techniques can help. Example from a neural network tool that could give insight: http://playground.tensorflow.org/

General feature importance

For general feature importance, i.e. over all targets, you can look at this part of the scikit-learn documentation.

The example here shows how you can do univariate feature selection with the F-test for feature scoring.

like image 72
Def_Os Avatar answered Oct 30 '22 16:10

Def_Os


I would manually construct separate binary classification models for each of your different possible target values and compare the models. You could possibly normalize the values, however the numerical values themselves are less informative that the ordering of the variables.

Also you might want to look at using a logistic regression model for a different way of calculating your feature importances.

like image 31
maxymoo Avatar answered Oct 30 '22 16:10

maxymoo