Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Tags:

I am working on keyword extraction problem. Consider the very general case

from sklearn.feature_extraction.text import TfidfVectorizer  tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')  t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.  "How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."  "Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"  Our best blessings are often the least appreciated."""  tfs = tfidf.fit_transform(t.split(" ")) str = 'tree cat travellers fruit jupiter' response = tfidf.transform([str]) feature_names = tfidf.get_feature_names()  for col in response.nonzero()[1]:     print(feature_names[col], ' - ', response[0, col])

and this gives me

  (0, 28)   0.443509712811   (0, 27)   0.517461475101   (0, 8)    0.517461475101   (0, 6)    0.517461475101 tree  -  0.443509712811 travellers  -  0.517461475101 jupiter  -  0.517461475101 fruit  -  0.517461475101

which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?

384

asked Dec 11 '15 20:12

AbtPst

1 Answers

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

feature_array = np.array(tfidf.get_feature_names()) tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]  n = 3 top_n = feature_array[tfidf_sorting][:n]

This gives me:

array([u'fruit', u'travellers', u'jupiter'],    dtype='<U13')

The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.

Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

141

answered Sep 30 '22 01:09

hume

Related questions
                            
                                Pandas slicing FutureWarning with 0.21.0
                            
                                Pytorch - RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed
                            
                                Control formatting of the argparse help argument list?
                            
                                How to install local packages using pip as part of a docker build?
                            
                                Change Django Templates Based on User-Agent
                            
                                How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
                            
                                Overriding inherited properties’ getters and setters in Python
                            
                                3D Scene Renderer for Python [closed]
                            
                                How to fix ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1056)?
                            
                                How does Python's Garbage Collector Detect Circular References?
                            
                                How to scrape a website that requires login first with Python
                            
                                Pandas and Matplotlib - fill_between() vs datetime64
                            
                                Python p-value from t-statistic
                            
                                Scikit-learn, get accuracy scores for each class
                            
                                Find longest repetitive sequence in a string
                            
                                Docstrings when nothing is returned
                            
                                TensorFlow: How and why to use SavedModel
                            
                                Reading serial data in realtime in Python
                            
                                Python library for playing fixed-frequency sound
                            
                                Format truncated Python float as int in string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Tags:

python

nlp

nltk

scikit-learn

tf-idf

AbtPst

People also ask

1 Answers

hume

Recent Activity

Donate For Us