Sklearn TFIDF vectorizer to run as parallel jobs

1 Answers

This is not directly possible because there is no way to parallelize/distribute access to the vocabulary that is needed for these vectorizers.

To perform parallel document vectorization, use the HashingVectorizer instead. The scikit docs provide an example using this vectorizer to train (and evaluate) a classifier in batches. A similar workflow also works for parallelization because input terms are mapped to the same vector indices without any communication between the parallel workers.

Simply compute the partial term-doc matrices separately and concatenate them once all jobs are done. At this point you may also run TfidfTransformer on the concatenated matrix.

The most significant drawback of not storing the vocabulary of input terms, is that it is difficult to find out which terms are mapped to which column in the final matrix (i.e. inverse transform). The only efficient mapping is to use the hashing function on a term to see which column/index it is assigned to. For an inverse transform, you would need to do this for all unique terms (i.e. your vocabulary).

answered Sep 23 '22 07:09

AliOli

Related questions
                            
                                trying to import a module: undefined symbol: PyUnicodeUCS4_DecodeUTF8
                            
                                python appending a value to a sublist [duplicate]
                            
                                Why is an empty dictionary greater than 1?
                            
                                Differences between `class` and `def` [closed]
                            
                                Using Counter with list of lists
                            
                                Detecting 'unusual behavior' using machine learning with CouchDB and Python?
                            
                                Python mock: mocking base class for inheritance
                            
                                Index 2D numpy array by a 2D array of indices without loops
                            
                                using google app engine SDK in pycharm
                            
                                python easy_install fails with SSL certificate error for all packages
                            
                                How to plot a pandas multiindex dataFrame with all xticks
                            
                                Command python setup.py egg_info failed with error code 1
                            
                                How to read .npy files in Matlab
                            
                                Tkinter set focus on Entry widget
                            
                                How does open() work with and without `with`?
                            
                                How to comment each condition in a multi-line if statement?
                            
                                Antialiasing shapes in Pygame
                            
                                Get rid of NaT values from pandas dataframe
                            
                                Sphinx documentation: how to reference a Python property?
                            
                                Vectorizing a function in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sklearn TFIDF vectorizer to run as parallel jobs

Tags:

python

scikit-learn

sbalajis

People also ask

1 Answers

AliOli

Recent Activity

Donate For Us