I am processing a huge amount of text data in sklearn. First I need to vectorize the text context (word counts) and then perform a TfidfTransformer. I have the following code that doesn't seem to take the output from CountVectorizer to the input of TfidfTransformer.
TEXT = [data[i].values()[3] for i in range(len(data))]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer(min_df=0.01,max_df = 2.5, lowercase = False, stop_words = 'english')
X = vectorizer(TEXT)
transformer = TfidfTransformer(X)
X = transformer.fit_transform()
As I run this code, I obtain this error:
Traceback (most recent call last):
File "nlpQ2.py", line 27, in <module>
X = vectorizer(TEXT)
TypeError: 'CountVectorizer' object is not callable
I thought I had vectorized the text and now it's in a matrix -- is there a transition step that I have missed? Thank you!!
This line
X = vectorizer(TEXT)
does not produce the output of the vectorizer (and this is the one raising the exception, it has nothing to do with TfIdf itself), you are supposed to call fit_transform
. Furthermore, your next call is also wrong. You have to pass data as an argument to fit_transform
, not to constructor.
X = vectorizer.fit_transform(TEXT)
transformer = TfidfTransformer()
X = transformer.fit_transform(X)
You're probably looking for a pipeline, perhaps something like this:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
or
pipeline = make_pipeline(CountVectorizer(), TfidfTransformer())
On this pipeline, perform the regular operations (e.g., fit
, fit_transform
, and so forth).
See this example also.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With