Attribute's predictive capacity for a particular target in Python, using feature selection in Sklearn

Tags:

Are there any feature selection methods in Scikit-Learn (or algos in general) that give weights of an attribute's ability/predictive-capacity/importance to predict a specific target? For example, the from sklearn.datasets import load_iris, ranking each of the 4 attributes weights to predict the 3 iris species separately but for much more complex datasets w/ ~1k-10k attributes.

I'm looking for something analogous to the feature_importances_ from RandomForestClassifier. However, RandomForestClassifer gives weights to each attribute for the entire prediction process. The weights do not need to add up to one but I want to find a way to correlate a specific subset of attributes to a specific target.

First I tried "overfitting" the models to enrich for a specific target but the results didn't seem to change much between targets. Second, I tried going the ordination route by finding which attributes have the greatest variation but that doesn't directly translate to predictive capacity. Third, I tried sparse models but I encountered the same problem as using feature_importances_.

A link to an example or tutorial that does exactly this is sufficient. Possibly a tutorial on how to traverse decision trees in a random forest and store the nodes that are predictive of specific targets.

433

asked Nov 23 '16 21:11

O.rka

2 Answers

Single targets

Most models are hardly black boxes, so if you are interested in a specific target, you could simply look at the coefficients of the model and do the model calculation by hand to understand how the model came to its output. E.g.:

For a linear model you simply need to multiply with the coefficients and add the bias
For a neural network you need to know all the coefficients and activation functions and do a few calculations to have a look at how the inputs are translated into new 'features' in the hidden layers and then finally outputs
For a random forest you need to look at the decision boundaries of all the trees in the forest
Etc.

Based on such analysis, you could decide what inputs you consider most important.

Sensitivity analysis

More useful, perhaps, would be to look at how the model output changes when your input values change. This will give you a higher-level insight into how important and sensitive the inputs are. This concept is called sensitivity analysis. For most methods, you could simply do some random sampling on the inputs and analyze the outputs.

This can be useful for feature selection, as insensitive inputs are candidates for pruning.

Looking back into the model

Sensitivity analysis is based on the idea of perturbing the input to the model to learn something about how the model comes up with its output. The other way of looking at things would be to take the output and reason backwards into the model and finally the inputs. Such an approach is:

Highly specific to the model technique in question
Complex, since the more non-linear a model is, and the more feature interactions the model has, the harder it is to 'untangle things'.

For a discussion specific to Random Forests, have a look at this Q&A.

Visualization techniques can help. Example from a neural network tool that could give insight: http://playground.tensorflow.org/

General feature importance

For general feature importance, i.e. over all targets, you can look at this part of the scikit-learn documentation.

The example here shows how you can do univariate feature selection with the F-test for feature scoring.

answered Oct 30 '22 16:10

Def_Os

I would manually construct separate binary classification models for each of your different possible target values and compare the models. You could possibly normalize the values, however the numerical values themselves are less informative that the ordering of the variables.

Also you might want to look at using a logistic regression model for a different way of calculating your feature importances.

answered Oct 30 '22 16:10

maxymoo

Related questions
                            
                                Testing a POST that uses Flask-WTF validate_on_submit
                            
                                Why does Python's float raise ValueError for some very long inputs?
                            
                                How to write a complete Python wrapper around a C Struct using Cython?
                            
                                Python to excel, openpyxl and file format not valid
                            
                                Django JSONField isnull lookup
                            
                                Importing GDAL prints lots of error messages, but still works
                            
                                OpenCV exception after 1 day calculation
                            
                                How do I tell sqlalchemy to ignore certain (say, null) columns on INSERT
                            
                                Flask FileStorage object to File Object
                            
                                Where's the logic that returns an instance of a subclass of OSError exception class?
                            
                                Self-built extension module slower than built-in c module
                            
                                Define a variable in sympy to be a CONSTANT
                            
                                Tuple assignment in Python, Is this a bug in Python? [duplicate]
                            
                                How to import modules from site-packages when in a different directory?
                            
                                Is it possible to def a function with a dotted name in Python?
                            
                                More efficient way to loop through PySpark DataFrame and create new columns
                            
                                How can I count each UDP packet sent out by subprocesses?
                            
                                BeautifulSoup with multiple tags, each tag with a specific class
                            
                                Python pandas dataframe: find max for each unique values of an another column
                            
                                ImportError: No module named 'setup'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Attribute's predictive capacity for a particular target in Python, using feature selection in Sklearn

Tags:

python

machine-learning

classification

scikit-learn

feature-selection

O.rka

People also ask

2 Answers

Def_Os

maxymoo

Recent Activity

Donate For Us