How can I pass a preprocessor to TfidfVectorizer? - sklearn - python

Tags:

How can I pass a preprocessor to TfidfVectorizer? I made a function that takes a string and returns a preprocessed string then I set processor parameter to that function "preprocessor=preprocess", but it doesn't work. I've searched so many times, but I didn't found any example as if no one use it.

I have another question. Does it (preprocessor parameter) override removing stopwords and lowereing case that could be done using stop_words and lowercase parameters?

444

asked May 24 '14 22:05

eman

1 Answers

You simply define a function that takes a string as input and retuns what is to be preprocessed. So for example a trivial function to uppercase strings would look like this:

def preProcess(s):
    return s.upper()

Once you have your function made then you just pass it into your TfidfVectorizer object. For example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?'
     ]

X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()

Results in:

[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']

This indirectly answers your follow-up question since despite lowercase being set to true, the preprocess function to uppercase overrides it. This is also mentioned in the documentation:

preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

answered Oct 06 '22 17:10

David

Related questions
                            
                                Bitwise operator in SQLAlchemy
                            
                                Configuring django settings to work with 1.4.1. Loading template error
                            
                                unpacking function argument [duplicate]
                            
                                How to convert data values into color information for matplotlib?
                            
                                Change the volume of a wav file in python
                            
                                Converting a datetime into a string and back again
                            
                                Python Requests - managing cookies
                            
                                ctypes return a string from c function
                            
                                Python on the AWS Beanstalk. How to snapshot custom logs?
                            
                                Python - looping over files - order
                            
                                Creating a numpy array of 3D coordinates from three 1D arrays
                            
                                Return statement on multiple lines
                            
                                Correct use of $ne or $not in pymongo (unsupported projection option)
                            
                                How can i get list of font family(or Name of Font) in matplotlib
                            
                                DateField is not rendered as type="date"
                            
                                numpy histogram cumulative density does not sum to 1
                            
                                Binding a PyQT/PySide widget to a local variable in Python
                            
                                How do I reverse a sublist in a list in place?
                            
                                Can't install pycurl with pip
                            
                                Calculation error with pow operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I pass a preprocessor to TfidfVectorizer? - sklearn - python

Tags:

python

preprocessor

scikit-learn

eman

People also ask

1 Answers

David

Recent Activity

Donate For Us