Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting TfidfVectorizer output by tf-idf (lowest to highest and vice versa)

I'm using TfidfVectorizer() from sklearn on part of my text data to get a sense of term-frequency for each feature (word). My current code is the following

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')

# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)

If I want to sort the tf-idf values of each term in 'X_traintfidf' from the lowest to highest (and vice versa), say, top10, and make these sorted tf-idf value rankings into two Series objects, how should I proceed from the last line of my code?

Thank you.

I was reading a similar thread but couldn't figure out how to do it. Maybe someone will be able to connect the tips shown in that thread to my question here.

like image 968
Chris T. Avatar asked Aug 21 '17 21:08

Chris T.


1 Answers

After the fit_transform(), you'll have access to the existing vocabulary through get_feature_names() method. You can do this:

terms = tfidf.get_feature_names()

# sum tfidf frequency of each term through documents
sums = X_traintfidf.sum(axis=0)

# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
    data.append( (term, sums[0,col] ))

ranking = pd.DataFrame(data, columns=['term','rank'])
print(ranking.sort_values('rank', ascending=False))
like image 86
Adelson Araújo Avatar answered Sep 20 '22 11:09

Adelson Araújo