I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_
parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_
values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?
Logistic Regression Feature Importance We can fit a LogisticRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable. These coefficients can provide the basis for a crude feature importance score.
The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is "important" if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.
Its features are sepal length, sepal width, petal length, petal width. Besides, its target classes are setosa, versicolor and virginica. However, it has 3 classes in the target and this causes to build 3 different binary classification models with logistic regression.
One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
Consider this example:
import numpy as np from sklearn.linear_model import LogisticRegression x1 = np.random.randn(100) x2 = 4*np.random.randn(100) x3 = 0.5*np.random.randn(100) y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0 X = np.column_stack([x1, x2, x3]) m = LogisticRegression() m.fit(X, y) # The estimated coefficients will all be around 1: print(m.coef_) # Those values, however, will show that the second parameter # is more influential print(np.std(X, 0)*m.coef_)
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y) print(m.coef_)
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With