Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

I am working on keyword extraction problem. Consider the very general case

from sklearn.feature_extraction.text import TfidfVectorizer  tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')  t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.  "How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."  "Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"  Our best blessings are often the least appreciated."""  tfs = tfidf.fit_transform(t.split(" ")) str = 'tree cat travellers fruit jupiter' response = tfidf.transform([str]) feature_names = tfidf.get_feature_names()  for col in response.nonzero()[1]:     print(feature_names[col], ' - ', response[0, col]) 

and this gives me

  (0, 28)   0.443509712811   (0, 27)   0.517461475101   (0, 8)    0.517461475101   (0, 6)    0.517461475101 tree  -  0.443509712811 travellers  -  0.517461475101 jupiter  -  0.517461475101 fruit  -  0.517461475101 

which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?

like image 384
AbtPst Avatar asked Dec 11 '15 20:12

AbtPst


People also ask

How do I get IDF values from TfidfVectorizer?

You can just use TfidfVectorizer with use_idf=True (default value) and then extract with idf_. How would you get the IDF value, for example for the term "not". IDF ("not")= something? The attributes "vocabulary_" give you the mapping between the word and the feature indice.

How do I get my TF-IDF score?

As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

What is the difference between TfidfVectorizer and Tfidftransformer?

Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.

What is the difference between CountVectorizer and TfidfVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.


1 Answers

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

feature_array = np.array(tfidf.get_feature_names()) tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]  n = 3 top_n = feature_array[tfidf_sorting][:n] 

This gives me:

array([u'fruit', u'travellers', u'jupiter'],    dtype='<U13') 

The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.

Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

like image 141
hume Avatar answered Sep 30 '22 01:09

hume