Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get feature importance in logistic regression using weights?

I have a dataset of reviews which has a class label of positive/negative. I am applying Logistic regression to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix

count_vect = CountVectorizer() 
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)

split the data set into train and test

X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)

I am applying the logistic regression algorithm as follows

optimal_lambda = 0.001000
log_reg_optimal = LogisticRegression(C=optimal_lambda)

# fitting the model
log_reg_optimal.fit(X_tr, y_tr)

# predict the response
pred = log_reg_optimal.predict(X_test)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc))

My weights are

weights = log_reg_optimal.coef_ .   #<class 'numpy.ndarray'>

array([[-0.23729528, -0.16050616, -0.1382504 , ...,  0.27291847,
         0.35857267,  0.41756443]])
(1, 38178) #shape of weights

I want to get the feature importance i.e; top 100 features which have high weights. Could anyone tell me how to get them?

like image 841
merkle Avatar asked Jul 22 '18 07:07

merkle


People also ask

Is it possible to get feature importance from the weights of Hyperplane in logistic regression?

Logistic Regression An inherently binary classification algorithm, it tries to find the best hyperplane in k-dimensional space that separates the 2 classes, minimizing logistic loss. The k dimensional weight vector can be used to get feature importance.

How is feature importance calculated in logistic regression?

Logistic Regression Feature Importance We can fit a LogisticRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable. These coefficients can provide the basis for a crude feature importance score.

How do you interpret weights in logistic regression?

For example, if you have odds of 2, it means that the probability for y=1 is twice as high as y=0. If you have a weight (= log odds ratio) of 0.7, then increasing the respective feature by one unit multiplies the odds by exp(0.7) (approximately 2) and the odds change to 4.

How do you calculate feature importance?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.


2 Answers

One way to investigate the "influence" or "importance" of a given feature / parameter in a linear classification model is to consider the magnitude of the coefficients.

This is the most basic approach. Other techniques for finding feature importance or parameter influence could provide more insight such as using p-values, bootstrap scores, various "discriminative indices", etc.


Here, you have standardized the data so use directly this:

weights = log_reg_optimal.coef_
abs_weights = np.abs(weights)

print(abs_weights)

If you look at the original weights then a negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class.


EDIT 1

Example showing how to obtain the feature names:

import numpy as np

#features names
names_of_variables =np.array(['a','b','c','d'])

#create random weights and get the magnitude
weights = np.random.rand(4)
abs_weights = np.abs(weights)

#get the sorting indices
sorted_index = np.argsort(abs_weights)[::-1]

#check if the sorting indices are correct
print(abs_weights[sorted_index])

#get the index of the top-2 features
top_2 = sorted_index[:2]

#get the names of the top 2 most important features
print(names_of_variables[top_2])
like image 83
seralouk Avatar answered Oct 16 '22 10:10

seralouk


If you are using a logistic regression model then you can use the Recursive Feature Elimination(RFE) method to select important features and filter out the redundant features from the predictor lists. This feature is available in the scikit-learn library. You can refer the following link to get the detailed information: https://machinelearningmastery.com/feature-selection-machine-learning-python/

This method ranks the features based on the importance and you can select the top n features required for your further analysis.

like image 41
Gaurav Sitaula Avatar answered Oct 16 '22 10:10

Gaurav Sitaula