I am using Python's sklearn
random forest (ensemble.RandomForestClassifier
) to do classification and am using feature_importances_
to find significant feature for the classifier. Now my code is:
for trip in database:
venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature
feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)
orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())
# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types
data = orig_ven_feat.tocsr()
le = LabelEncoder()
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
unlabelled_int = int(le.transform(["Unlabelled"]))
else:
unlabelled_int = -1
valid_rows_idx = np.where(labels!=unlabelled_int)[0]
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification
clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
train_data = data[train_ind,:]
test_data = data[test_ind,:]
labels_train = labels[train_ind]
labels_test = labels[test_ind]
print ("Training classifier...")
clf.fit(train_data,labels_train)
importances = clf.feature_importances_
Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)
I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:
How to get indices of top 20 from importances
Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)
Any idea to help me? Thank you very much!
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
Feature Importance refers to techniques that calculate a score for all the input features for a given model — the scores simply represent the “importance” of each feature. A higher score means that the specific feature will have a larger effect on the model that is being used to predict a certain variable.
You can calculate feature importance by measures such as permutation or shap impact. For each model with computed feature importance, get the ranking of the features. Compute the median rank of each feature by aggregating the ranks of the features across all models. Sort the aggregated list by the computed median rank.
To get the importance for each feature name, just iterate through the columns names and feature_importances together (they map to each other):
for feat, importance in zip(df.columns, clf.feature_importances_):
print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With