Are there any feature selection methods in Scikit-Learn
(or algos in general) that give weights of an attribute's ability/predictive-capacity/importance to predict a specific target? For example, the from sklearn.datasets import load_iris
, ranking each of the 4 attributes weights to predict the 3 iris species separately but for much more complex datasets w/ ~1k-10k attributes.
I'm looking for something analogous to the feature_importances_
from RandomForestClassifier. However, RandomForestClassifer
gives weights to each attribute for the entire prediction process. The weights do not need to add up to one but I want to find a way to correlate a specific subset of attributes to a specific target.
First I tried "overfitting" the models to enrich for a specific target but the results didn't seem to change much between targets. Second, I tried going the ordination route by finding which attributes have the greatest variation but that doesn't directly translate to predictive capacity. Third, I tried sparse models but I encountered the same problem as using feature_importances_
.
A link to an example or tutorial that does exactly this is sufficient. Possibly a tutorial on how to traverse decision trees in a random forest and store the nodes that are predictive of specific targets.
The Sklearn 'Predict' Method Predicts an Output That being the case, it provides a set of tools for doing things like training and evaluating machine learning models. And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).
Single targets
Most models are hardly black boxes, so if you are interested in a specific target, you could simply look at the coefficients of the model and do the model calculation by hand to understand how the model came to its output. E.g.:
Based on such analysis, you could decide what inputs you consider most important.
Sensitivity analysis
More useful, perhaps, would be to look at how the model output changes when your input values change. This will give you a higher-level insight into how important and sensitive the inputs are. This concept is called sensitivity analysis. For most methods, you could simply do some random sampling on the inputs and analyze the outputs.
This can be useful for feature selection, as insensitive inputs are candidates for pruning.
Looking back into the model
Sensitivity analysis is based on the idea of perturbing the input to the model to learn something about how the model comes up with its output. The other way of looking at things would be to take the output and reason backwards into the model and finally the inputs. Such an approach is:
For a discussion specific to Random Forests, have a look at this Q&A.
Visualization techniques can help. Example from a neural network tool that could give insight: http://playground.tensorflow.org/
General feature importance
For general feature importance, i.e. over all targets, you can look at this part of the scikit-learn documentation.
The example here shows how you can do univariate feature selection with the F-test for feature scoring.
I would manually construct separate binary classification models for each of your different possible target values and compare the models. You could possibly normalize the values, however the numerical values themselves are less informative that the ordering of the variables.
Also you might want to look at using a logistic regression model for a different way of calculating your feature importances.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With