I'm pretty sure it's been asked before, but I'm unable to find an answer
Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method
classf = linear_model.LogisticRegression()
func = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)
How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?
Its features are sepal length, sepal width, petal length, petal width. Besides, its target classes are setosa, versicolor and virginica. However, it has 3 classes in the target and this causes to build 3 different binary classification models with logistic regression.
Logistic Regression Feature Importance We can fit a LogisticRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable. These coefficients can provide the basis for a crude feature importance score.
The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled [1]. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature.
It's not some rule that specifies how many features you are permitted to use. The Rule of 10 is descriptive, not prescriptive, and it's an approximate guideline: if the number of instances is much fewer than 10 times the number of features, you're at especially high risk of overfitting, and you might get poor results.
As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow this format for comparison.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)
#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0
X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})
#Scale your data
scaler = StandardScaler()
scaler.fit(X)
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)
clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)
feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')
plt.tight_layout()
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With