I have created a tf-idf matrix but now I want to retrieve top 2 words for each document. I want to pass document id and it should give me the top 2 words.
Right now, I have this sample data:
from sklearn.feature_extraction.text import TfidfVectorizer
d = {'doc1':"this is the first document",'doc2':"it is a sunny day"} ### corpus
test_v = TfidfVectorizer(min_df=1) ### applied the model
t = test_v.fit_transform(d.values())
feature_names = test_v.get_feature_names() ### list of words/terms
>>> feature_names
['day', 'document', 'first', 'is', 'it', 'sunny', 'the', 'this']
>>> t.toarray()
array([[ 0. , 0.47107781, 0.47107781, 0.33517574, 0. ,
0. , 0.47107781, 0.47107781],
[ 0.53404633, 0. , 0. , 0.37997836, 0.53404633,
0.53404633, 0. , 0. ]])
I can access the matrix by giving the row number eg.
>>> t[0,1]
0.47107781233161794
Is there a way I can be able to access this matrix by document id? In my case 'doc1' and 'doc2'.
Thanks
The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer. transform(...) . This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n keywords.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
get_feature_names() . This will print feature names selected (terms selected) from the raw documents. You can also use tfidf_vectorizer.
By doing
t = test_v.fit_transform(d.values())
you lose any link to the document ids. A dict is not ordered so you have no idea which value is given in which order. The means that before passing the values to the fit_transform function you need to record which value corresponds to which id.
For example what you can do is:
counter = 0
values = []
key = {}
for k,v in d.items():
values.append(v)
key[k] = counter
counter+=1
t = test_v.fit_transform(values)
From there you can build a function to access this matix by document id:
def get_doc_row(docid):
rowid = key[docid]
row = t[rowid,:]
return row
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With