I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction. Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here, but I'm yet to find a good alternative. So far the best I have found are the following 3 options: <ol> <li> Monte Carlo Option: Fixing all other features, re-run the prediction replacing the feature we want to evaluate with random samples from the training set. Do this a large number of times. This would establish a baseline probability for the positive class. Then compare with the probability of the positive class of the original run. The difference is a measure of Importance of the feature. </li> <li> "Leave-one-out" classifiers: To evaluate the importance of a feature, first create a model which uses all features, and then another that uses all features except the one being tested. Predict the new observation using both models. The difference between the two would be the importance of the feature. </li> <li> Adjusted betas: Based on this answer, ranking the importance of the features by 'the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.' </li> </ol> All options (using betas, Monte Carlo and "Leave-one-out") seem like poor solutions to me. <ol> <li>The Monte Carlo is dependent on the distribution of the training set, and I cannot find any literature to support it.</li> <li>The "leave one out" would be easily tricked by two correlated features (when one were absent, the other one would step in to compensate, and both would be given 0 importance). </li> <li>The adjusted betas sounds plausible, but I cannot find any literature to support it. </li> </ol> Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier? Quick note #1: for Random Forests this is trivial, we can simply use the <code>prediction + bias</code> decomposition, as explained beautifully in this blog post. The problem here is how to do something similar with linear classifiers such as Logistic Regression. Quick note #2: there are a number of related questions on stackoverflow (1 2 3 4 5). I have not been able to find an answer to this specific question.

If you want the importance of the features for a particular decision, why not simulate the <code>decision_function</code> (Which is provided by scikit-learn, so you can test whether you get the same value) step by step? The decision function for linear classifiers is simply: <code>intercept_ + coef_[0]*feature[0] + coef_[1]*feature[1] + ...</code> The importance of a feature i is then just <code>coef_[i]*feature[i]</code>. Of course this is similar to looking at the magnitude of the coefficients, but since it is multiplied with the actual feature and it is also what happens under the hood it might be your best bet.

How can I get the relative importance of features of a logistic regression for a particular prediction?

Tags:

machine-learning

scikit-learn

logistic-regression

feature-selection

coefficients

I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction.

Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here, but I'm yet to find a good alternative.

So far the best I have found are the following 3 options:

Monte Carlo Option: Fixing all other features, re-run the prediction replacing the feature we want to evaluate with random samples from the training set. Do this a large number of times. This would establish a baseline probability for the positive class. Then compare with the probability of the positive class of the original run. The difference is a measure of Importance of the feature.
"Leave-one-out" classifiers: To evaluate the importance of a feature, first create a model which uses all features, and then another that uses all features except the one being tested. Predict the new observation using both models. The difference between the two would be the importance of the feature.
Adjusted betas: Based on this answer, ranking the importance of the features by 'the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.'

All options (using betas, Monte Carlo and "Leave-one-out") seem like poor solutions to me.

The Monte Carlo is dependent on the distribution of the training set, and I cannot find any literature to support it.
The "leave one out" would be easily tricked by two correlated features (when one were absent, the other one would step in to compensate, and both would be given 0 importance).
The adjusted betas sounds plausible, but I cannot find any literature to support it.

Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?

Quick note #1: for Random Forests this is trivial, we can simply use the prediction + bias decomposition, as explained beautifully in this blog post. The problem here is how to do something similar with linear classifiers such as Logistic Regression.

Quick note #2: there are a number of related questions on stackoverflow (1 2 3 4 5). I have not been able to find an answer to this specific question.

688

asked Dec 30 '15 12:12

sapo_cosmico

1 Answers

If you want the importance of the features for a particular decision, why not simulate the decision_function (Which is provided by scikit-learn, so you can test whether you get the same value) step by step? The decision function for linear classifiers is simply:

intercept_ + coef_[0]*feature[0] + coef_[1]*feature[1] + ...

The importance of a feature i is then just coef_[i]*feature[i]. Of course this is similar to looking at the magnitude of the coefficients, but since it is multiplied with the actual feature and it is also what happens under the hood it might be your best bet.

answered Nov 10 '22 08:11

Robin Spiess

Related questions
                            
                                Why rank-based recommendation use NDCG?
                            
                                pySpark: Save ML Model
                            
                                Is there an easy way to implement a Optimizer.Maximize() function in TensorFlow
                            
                                Tensorflow seq2seq multidimensional regression
                            
                                Updating an old system to Q-learning with Neural Networks
                            
                                Paragraph Segmentation using Machine Learning
                            
                                Keras + Tensorflow : Debug NaNs
                            
                                Shape Detection using Machine Learning
                            
                                "ValueError: Trying to share variable $var, but specified dtype float32 and found dtype float64_ref" when trying to use get_variable
                            
                                How to return transformed data from an ML.Net pipeline before a predictor is applied
                            
                                Current node to next node feature combinations in decision tree learning: useful to determine potential interactions?
                            
                                Unable to train my keras model : (Data cardinality is ambiguous:)
                            
                                error when using Mirrored strategy in Tensorflow
                            
                                Keras custom loss function to ignore false negatives of a specific class during semantic segmentation?
                            
                                Problems with real-valued input deep belief networks (of RBMs)
                            
                                How can I efficiently use an R prediction model from Java?
                            
                                Implement Gaussian Naive Bayes
                            
                                Named entities as a feature in text categorization?
                            
                                enet() works but not when run via caret::train()
                            
                                How can I speed up a topic model in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With