I have a dataset of reviews which has a class label of positive/negative. I am applying Logistic regression to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix <pre class="prettyprint"><code>count_vect = CountVectorizer() final_counts = count_vect.fit_transform(sorted_data['Text'].values) standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts) </code></pre> split the data set into train and test <pre class="prettyprint"><code>X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0) X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3) </code></pre> I am applying the logistic regression algorithm as follows <pre class="prettyprint"><code>optimal_lambda = 0.001000 log_reg_optimal = LogisticRegression(C=optimal_lambda) # fitting the model log_reg_optimal.fit(X_tr, y_tr) # predict the response pred = log_reg_optimal.predict(X_test) # evaluate accuracy acc = accuracy_score(y_test, pred) * 100 print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc)) </code></pre> My weights are <pre class="prettyprint"><code>weights = log_reg_optimal.coef_ . #<class 'numpy.ndarray'> array([[-0.23729528, -0.16050616, -0.1382504 , ..., 0.27291847, 0.35857267, 0.41756443]]) (1, 38178) #shape of weights </code></pre> I want to get the feature importance i.e; top 100 features which have high weights. Could anyone tell me how to get them?

One way to investigate the "influence" or "importance" of a given feature / parameter in a linear classification model is to consider the magnitude of the coefficients. This is the most basic approach. Other techniques for finding feature importance or parameter influence could provide more insight such as using p-values, bootstrap scores, various "discriminative indices", etc. <hr> Here, you have standardized the data so use directly this: <pre class="prettyprint"><code>weights = log_reg_optimal.coef_ abs_weights = np.abs(weights) print(abs_weights) </code></pre> If you look at the original <code>weights</code> then a negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class. <hr> EDIT 1 Example showing how to obtain the feature names: <pre class="prettyprint"><code>import numpy as np #features names names_of_variables =np.array(['a','b','c','d']) #create random weights and get the magnitude weights = np.random.rand(4) abs_weights = np.abs(weights) #get the sorting indices sorted_index = np.argsort(abs_weights)[::-1] #check if the sorting indices are correct print(abs_weights[sorted_index]) #get the index of the top-2 features top_2 = sorted_index[:2] #get the names of the top 2 most important features print(names_of_variables[top_2]) </code></pre>

How to get feature importance in logistic regression using weights?

Tags:

machine-learning

scikit-learn

logistic-regression

sklearn-pandas

I have a dataset of reviews which has a class label of positive/negative. I am applying Logistic regression to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix

count_vect = CountVectorizer() 
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)

split the data set into train and test

X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)

I am applying the logistic regression algorithm as follows

optimal_lambda = 0.001000
log_reg_optimal = LogisticRegression(C=optimal_lambda)

# fitting the model
log_reg_optimal.fit(X_tr, y_tr)

# predict the response
pred = log_reg_optimal.predict(X_test)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc))

My weights are

weights = log_reg_optimal.coef_ .   #<class 'numpy.ndarray'>

array([[-0.23729528, -0.16050616, -0.1382504 , ...,  0.27291847,
         0.35857267,  0.41756443]])
(1, 38178) #shape of weights

I want to get the feature importance i.e; top 100 features which have high weights. Could anyone tell me how to get them?

841

asked Jul 22 '18 07:07

merkle

2 Answers

One way to investigate the "influence" or "importance" of a given feature / parameter in a linear classification model is to consider the magnitude of the coefficients.

This is the most basic approach. Other techniques for finding feature importance or parameter influence could provide more insight such as using p-values, bootstrap scores, various "discriminative indices", etc.

Here, you have standardized the data so use directly this:

weights = log_reg_optimal.coef_
abs_weights = np.abs(weights)

print(abs_weights)

If you look at the original weights then a negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class.

EDIT 1

Example showing how to obtain the feature names:

import numpy as np

#features names
names_of_variables =np.array(['a','b','c','d'])

#create random weights and get the magnitude
weights = np.random.rand(4)
abs_weights = np.abs(weights)

#get the sorting indices
sorted_index = np.argsort(abs_weights)[::-1]

#check if the sorting indices are correct
print(abs_weights[sorted_index])

#get the index of the top-2 features
top_2 = sorted_index[:2]

#get the names of the top 2 most important features
print(names_of_variables[top_2])

answered Oct 16 '22 10:10

seralouk

If you are using a logistic regression model then you can use the Recursive Feature Elimination(RFE) method to select important features and filter out the redundant features from the predictor lists. This feature is available in the scikit-learn library. You can refer the following link to get the detailed information: https://machinelearningmastery.com/feature-selection-machine-learning-python/

This method ranks the features based on the importance and you can select the top n features required for your further analysis.

answered Oct 16 '22 10:10

Gaurav Sitaula

Related questions
                            
                                How to load a keras model saved as .pb
                            
                                How to plot a separator line between two data classes?
                            
                                ANN and SVM classification [closed]
                            
                                Plot SVM in 3 dimension
                            
                                Machine learning: Supervised learning to learn & predict next RSA code
                            
                                Validating Output From a Clustering Algorithm
                            
                                Web page recommender system
                            
                                Classification of objects from a video ( human, animals, others(cars etc.,) ) [closed]
                            
                                Using Weka on Images
                            
                                Does random forest in R have a limitation of size of training data?
                            
                                How to not standarize target data in scikit learn regression
                            
                                Difference between segmentation and classification
                            
                                Support vector machine in Python using libsvm example of features
                            
                                Is it important for a neural network to have normally distributed data?
                            
                                Why pretraining for DNN is not specified in keras?
                            
                                Considerations for using ReLU as activation function
                            
                                My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%
                            
                                How to use IP address as a feature in a neural network
                            
                                How to interpret MSE in Keras Regressor
                            
                                tf.keras.models.save_model and optimizer warning

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With