Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn TFIDF vectorizer to run as parallel jobs

How run sklearn TFIDF vectorizer (and COUNT vectorizer) to run as parallel jobs? Something similar to n_jobs=-1 parameter in other sklearn models.

like image 203
sbalajis Avatar asked Feb 08 '15 17:02

sbalajis


People also ask

What is the difference between tfidfvectorizer and sklearn feature extraction?

sklearn.feature_extraction.text.TfidfVectorizer on the other hand, can only process one column at a time, so you need to make a new transformer for each text column in your dataset. This can get a little tedious and in particular makes pipelines more verbose.

What are tfidfvectorizer and cosine_similarity?

Two functions from sklearn, Python’s machine learning library, that were used in the last post are TfIdfVectorizer and cosine_similarity. TfidfVectorizer is used in natural language processing, or NLP, and cosine_similarities is used to compare arrays. TfidfVectorizer converts a collection of raw documents to a matrix of Tfidf features.

Can tfidfvectorizer operate on multiple text columns?

This can get a little tedious and in particular makes pipelines more verbose. It'd be nice if TfidfVectorizer could also operate on multiple text columns, using the same settings for each column, perhaps with an option to make one vocabulary per column, or use a shared vocabulary across all the columns.

What is the difference between a hashing vectorizer and a TFIDF vectorizer?

What is the difference between a hashing vectorizer and a tfidf vectorizer 1 Sklearn tfidf vectorize returns different shape after fit_transform()


1 Answers

This is not directly possible because there is no way to parallelize/distribute access to the vocabulary that is needed for these vectorizers.

To perform parallel document vectorization, use the HashingVectorizer instead. The scikit docs provide an example using this vectorizer to train (and evaluate) a classifier in batches. A similar workflow also works for parallelization because input terms are mapped to the same vector indices without any communication between the parallel workers.

Simply compute the partial term-doc matrices separately and concatenate them once all jobs are done. At this point you may also run TfidfTransformer on the concatenated matrix.

The most significant drawback of not storing the vocabulary of input terms, is that it is difficult to find out which terms are mapped to which column in the final matrix (i.e. inverse transform). The only efficient mapping is to use the hashing function on a term to see which column/index it is assigned to. For an inverse transform, you would need to do this for all unique terms (i.e. your vocabulary).

like image 91
AliOli Avatar answered Sep 23 '22 07:09

AliOli