Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Sklearn's TfidfVectorizer transform

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.

from sklearn.feature_extraction.text import TfidfVectorizer  self.vocabulary = "a list of words I want to look for in the documents".split() self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',                   stop_words='english') self.vect.fit_transform(self.vocabulary)  ...  doc = "some string I want to get tf-idf vector for" tfidf = self.vect.transform(doc) 

The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.

like image 781
Sterling Avatar asked Nov 21 '13 21:11

Sterling


People also ask

What does TfidfVectorizer transform do?

TfidfVectorizer. Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer .

What is the difference between TfidfTransformer and TfidfVectorizer?

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.

What is the difference between CountVectorizer and TfidfVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

What is TF-IDF transformation?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...


1 Answers

If you want to compute tf-idf only for a given vocabulary, use vocabulary argument to TfidfVectorizer constructor,

vocabulary = "a list of words I want to look for in the documents".split() vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',             stop_words='english', vocabulary=vocabulary) 

Then, to fit, i.e. calculate counts, with a given corpus, i.e. an iterable of documents, use fit:

vect.fit(corpus) 

Method fit_transform is a shortening for

vect.fit(corpus) corpus_tf_idf = vect.transform(corpus)  

Last, transform method accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.

doc_tfidf = vect.transform([doc]) 
like image 111
alko Avatar answered Sep 20 '22 00:09

alko