I'm using TfidfVectorizer() from sklearn on part of my text data to get a sense of term-frequency for each feature (word). My current code is the following
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)
If I want to sort the tf-idf values of each term in 'X_traintfidf' from the lowest to highest (and vice versa), say, top10, and make these sorted tf-idf value rankings into two Series objects, how should I proceed from the last line of my code?
Thank you.
I was reading a similar thread but couldn't figure out how to do it. Maybe someone will be able to connect the tips shown in that thread to my question here.
After the fit_transform()
, you'll have access to the existing vocabulary through get_feature_names()
method. You can do this:
terms = tfidf.get_feature_names()
# sum tfidf frequency of each term through documents
sums = X_traintfidf.sum(axis=0)
# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
data.append( (term, sums[0,col] ))
ranking = pd.DataFrame(data, columns=['term','rank'])
print(ranking.sort_values('rank', ascending=False))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With