Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: How to get real feature name from feature_importances

I am using Python's sklearn random forest (ensemble.RandomForestClassifier) to do classification and am using feature_importances_ to find significant feature for the classifier. Now my code is:

for trip in database:
    venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature

feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)

orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr()

le = LabelEncoder() 
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
    unlabelled_int = int(le.transform(["Unlabelled"]))
else:
    unlabelled_int = -1

valid_rows_idx = np.where(labels!=unlabelled_int)[0]  
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)                      
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
    train_data = data[train_ind,:]
    test_data = data[test_ind,:]
    labels_train = labels[train_ind]
    labels_test = labels[test_ind]

    print ("Training classifier...")
    clf.fit(train_data,labels_train)
    importances = clf.feature_importances_

Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)

I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:

  1. How to get indices of top 20 from importances

  2. Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)

Any idea to help me? Thank you very much!

like image 467
gladys0313 Avatar asked May 20 '15 16:05

gladys0313


People also ask

How do you find the feature important in a decision tree in Python?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

How to interpret feature importance?

Feature Importance refers to techniques that calculate a score for all the input features for a given model — the scores simply represent the “importance” of each feature. A higher score means that the specific feature will have a larger effect on the model that is being used to predict a certain variable.

How to rank feature importance?

You can calculate feature importance by measures such as permutation or shap impact. For each model with computed feature importance, get the ranking of the features. Compute the median rank of each feature by aggregating the ranks of the features across all models. Sort the aggregated list by the computed median rank.


1 Answers

To get the importance for each feature name, just iterate through the columns names and feature_importances together (they map to each other):

for feat, importance in zip(df.columns, clf.feature_importances_):
    print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)
like image 166
Jared Wilber Avatar answered Oct 06 '22 01:10

Jared Wilber