Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

show feature names after feature selection

I need to build a classifier for text, and now I'm using TfidfVectorizer and SelectKBest to selection the features, as following:

vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english',charset_error='strict')

X_train_features = vectorizer.fit_transform(data_train.data)
y_train_labels = data_train.target;

ch2 = SelectKBest(chi2, k = 1000)
X_train_features = ch2.fit_transform(X_train_features, y_train_labels)

I want to print out selected features name(text) after select k best features, is there any way to do that? I just need to print out selected feature names, maybe I should use CountVectorizer instead?

like image 541
user1687717 Avatar asked Jan 03 '13 05:01

user1687717


People also ask

What is F_classif?

F-score calculated by f_classif can be calculated by hand using the following formula shown in the image: Reference video. Intuitively, it is the ratio of (variance in output feature(y) explained by input feature(X) and variance in output feature(y) not explained by input feature(X)).


2 Answers

The following should work:

np.asarray(vectorizer.get_feature_names())[ch2.get_support()]
like image 97
ogrisel Avatar answered Oct 18 '22 15:10

ogrisel


To expand on @ogrisel's answer, the returned list of features is in the same order when they've been vectorized. The code below will give you a list of top ranked features sorted according to their Chi-2 scores in descending order (along with the corresponding p-values):

top_ranked_features = sorted(enumerate(ch2.scores_),key=lambda x:x[1], reverse=True)[:1000]
top_ranked_features_indices = map(list,zip(*top_ranked_features))[0]
for feature_pvalue in zip(np.asarray(train_vectorizer.get_feature_names())[top_ranked_features_indices],ch2.pvalues_[top_ranked_features_indices]):
        print feature_pvalue
like image 36
Moses Xu Avatar answered Oct 18 '22 13:10

Moses Xu