Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the importance of the features for a logistic regression model?

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?

like image 335
mgokhanbakal Avatar asked Dec 02 '15 20:12

mgokhanbakal


People also ask

How is feature importance calculated in logistic regression?

Logistic Regression Feature Importance We can fit a LogisticRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable. These coefficients can provide the basis for a crude feature importance score.

How do you determine the importance of a feature?

The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is "important" if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.

What are features in logistic regression?

Its features are sepal length, sepal width, petal length, petal width. Besides, its target classes are setosa, versicolor and virginica. However, it has 3 classes in the target and this causes to build 3 different binary classification models with logistic regression.


1 Answers

One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

Consider this example:

import numpy as np     from sklearn.linear_model import LogisticRegression  x1 = np.random.randn(100) x2 = 4*np.random.randn(100) x3 = 0.5*np.random.randn(100) y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0 X = np.column_stack([x1, x2, x3])  m = LogisticRegression() m.fit(X, y)  # The estimated coefficients will all be around 1: print(m.coef_)  # Those values, however, will show that the second parameter # is more influential print(np.std(X, 0)*m.coef_) 

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

m.fit(X / np.std(X, 0), y) print(m.coef_) 

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

like image 143
KT. Avatar answered Sep 22 '22 08:09

KT.