I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem.
But how do I find the TF-IDF score of a specific term in the document? i.e. is there some sort of dictionary between terms (in their textual representation) and their position in the resulting sparse matrix?
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
Tf-Idf for text Representation. Tf-Idf (Term frequency — Inverse document frequency) is a bag-of-word model which is very powerful in capturing the most important words in your text. The concept behind the Tf-Idf can be understood by the term frequency (Tf) and inverse document frequency (Idf).
In python tf-idf values can be computed using TfidfVectorizer() method in sklearn module.
Yes. See .vocabulary_
on your fitted/transformed TF-IDF vectorizer.
In [1]: from sklearn.datasets import fetch_20newsgroups
In [2]: data = fetch_20newsgroups(categories=['rec.autos'])
In [3]: from sklearn.feature_extraction.text import TfidfVectorizer
In [4]: cv = TfidfVectorizer()
In [5]: X = cv.fit_transform(data.data)
In [6]: cv.vocabulary_
It is a dictionary of the form:
{word : column index in array}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With