python: How to get real feature name from feature_importances

Tags:

I am using Python's sklearn random forest (ensemble.RandomForestClassifier) to do classification and am using feature_importances_ to find significant feature for the classifier. Now my code is:

for trip in database:
    venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature

feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)

orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr()

le = LabelEncoder() 
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
    unlabelled_int = int(le.transform(["Unlabelled"]))
else:
    unlabelled_int = -1

valid_rows_idx = np.where(labels!=unlabelled_int)[0]  
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)                      
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
    train_data = data[train_ind,:]
    test_data = data[test_ind,:]
    labels_train = labels[train_ind]
    labels_test = labels[test_ind]

    print ("Training classifier...")
    clf.fit(train_data,labels_train)
    importances = clf.feature_importances_

Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)

I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:

How to get indices of top 20 from importances
Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)

Any idea to help me? Thank you very much!

467

asked May 20 '15 16:05

gladys0313

1 Answers

To get the importance for each feature name, just iterate through the columns names and feature_importances together (they map to each other):

for feat, importance in zip(df.columns, clf.feature_importances_):
    print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)

166

answered Oct 06 '22 01:10

Jared Wilber

Related questions
                            
                                parallel excution and file writing on python
                            
                                How to get maya main window pointer using PySide?
                            
                                Django backwards relation
                            
                                Delete model by primary key in SQLAlchemy
                            
                                When cassandra-driver was executing the query, cassandra-driver returned error OperationTimedOut
                            
                                How can I completely remove any logging from requests module in Python
                            
                                Variable Substitution in Python
                            
                                Two Flask Applications at same time
                            
                                save a plot resulting from a function matplotlib python
                            
                                Python Voice Recognition Library - Always Listen?
                            
                                Efficiently find repeated characters in a string
                            
                                Python - check if a letter is in a list
                            
                                Convert list to comma separate values in django template
                            
                                How do I detect long blocking functions in Tornado application
                            
                                Drawing with turtle(python) using PyCharm
                            
                                Python Queue.join()
                            
                                Python Regex Subgroup Capturing
                            
                                Password Protect one webpage in Flask app
                            
                                How to draw subgraph using networkx
                            
                                How to skip `if __name__ == "__main__"` in interactive mode?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python: How to get real feature name from feature_importances

Tags:

python

classification

scikit-learn

feature-selection

gladys0313

People also ask

1 Answers

Jared Wilber

Recent Activity

Donate For Us