Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the tf-idf score of specific words in documents using sklearn

I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem.

But how do I find the TF-IDF score of a specific term in the document? i.e. is there some sort of dictionary between terms (in their textual representation) and their position in the resulting sparse matrix?

like image 913
WhiteTiger Avatar asked Jun 22 '15 09:06

WhiteTiger


People also ask

How TF-IDF is calculated in Sklearn?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

How do I find my TF-IDF value?

The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.

Is TF-IDF bag of words?

Tf-Idf for text Representation. Tf-Idf (Term frequency — Inverse document frequency) is a bag-of-word model which is very powerful in capturing the most important words in your text. The concept behind the Tf-Idf can be understood by the term frequency (Tf) and inverse document frequency (Idf).

How is TF-IDF calculated in python?

In python tf-idf values can be computed using TfidfVectorizer() method in sklearn module.


1 Answers

Yes. See .vocabulary_ on your fitted/transformed TF-IDF vectorizer.

In [1]: from sklearn.datasets import fetch_20newsgroups

In [2]: data = fetch_20newsgroups(categories=['rec.autos'])

In [3]: from sklearn.feature_extraction.text import TfidfVectorizer

In [4]: cv = TfidfVectorizer()

In [5]: X = cv.fit_transform(data.data)

In [6]: cv.vocabulary_

It is a dictionary of the form:

{word : column index in array}

like image 132
Ryan Avatar answered Sep 30 '22 15:09

Ryan