Using Sklearn's TfidfVectorizer transform

Tags:

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.

from sklearn.feature_extraction.text import TfidfVectorizer  self.vocabulary = "a list of words I want to look for in the documents".split() self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',                   stop_words='english') self.vect.fit_transform(self.vocabulary)  ...  doc = "some string I want to get tf-idf vector for" tfidf = self.vect.transform(doc)

The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.

781

asked Nov 21 '13 21:11

Sterling

1 Answers

If you want to compute tf-idf only for a given vocabulary, use vocabulary argument to TfidfVectorizer constructor,

vocabulary = "a list of words I want to look for in the documents".split() vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',             stop_words='english', vocabulary=vocabulary)

Then, to fit, i.e. calculate counts, with a given corpus, i.e. an iterable of documents, use fit:

vect.fit(corpus)

Method fit_transform is a shortening for

vect.fit(corpus) corpus_tf_idf = vect.transform(corpus)

Last, transform method accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.

doc_tfidf = vect.transform([doc])

111

answered Sep 20 '22 00:09

alko

Related questions
                            
                                In Python, can I specify a function argument's default in terms of other arguments?
                            
                                make matplotlib plotting window pop up as the active one
                            
                                Remove an imported python module [duplicate]
                            
                                Python: Passing parameters by name along with kwargs
                            
                                Cython: cimport and import numpy as (both) np
                            
                                How can modify request.data in django REST framework
                            
                                Difference between io.open vs open in python
                            
                                Where do I find the python standard library code?
                            
                                Python: why pickle?
                            
                                PIP: Installing only the dependencies
                            
                                How do I use the unittest setUpClass method()?
                            
                                How do I stack vectors of different lengths in NumPy?
                            
                                Is list[i:j] guaranteed to be an empty list if list[j] precedes list[i]?
                            
                                Get json data via url and use in python (simplejson)
                            
                                Where is WebDriver's Python API Documentation? [closed]
                            
                                PyCharm Not Properly Recognizing Requirements - Python, Django
                            
                                How to read bytes as stream in python 3
                            
                                Semaphores on Python
                            
                                Python-pdb skip code (as in "not execute")
                            
                                AKS Primes algorithm in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Sklearn's TfidfVectorizer transform

Tags:

python

text-mining

document

tf-idf

Sterling

People also ask

1 Answers

alko

Recent Activity

Donate For Us