Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I pass a preprocessor to TfidfVectorizer? - sklearn - python

How can I pass a preprocessor to TfidfVectorizer? I made a function that takes a string and returns a preprocessed string then I set processor parameter to that function "preprocessor=preprocess", but it doesn't work. I've searched so many times, but I didn't found any example as if no one use it.

I have another question. Does it (preprocessor parameter) override removing stopwords and lowereing case that could be done using stop_words and lowercase parameters?

like image 444
eman Avatar asked May 24 '14 22:05

eman


People also ask

What is the difference between TfidfVectorizer and TfidfTransformer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

What is TfidfVectorizer in Sklearn?

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator. vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

How is TfidfVectorizer different than CountVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.


1 Answers

You simply define a function that takes a string as input and retuns what is to be preprocessed. So for example a trivial function to uppercase strings would look like this:

def preProcess(s):
    return s.upper()

Once you have your function made then you just pass it into your TfidfVectorizer object. For example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?'
     ]

X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()

Results in:

[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']

This indirectly answers your follow-up question since despite lowercase being set to true, the preprocess function to uppercase overrides it. This is also mentioned in the documentation:

preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

like image 73
David Avatar answered Oct 06 '22 17:10

David