How run sklearn TFIDF vectorizer (and COUNT vectorizer) to run as parallel jobs? Something similar to n_jobs=-1 parameter in other sklearn models.
sklearn.feature_extraction.text.TfidfVectorizer on the other hand, can only process one column at a time, so you need to make a new transformer for each text column in your dataset. This can get a little tedious and in particular makes pipelines more verbose.
Two functions from sklearn, Python’s machine learning library, that were used in the last post are TfIdfVectorizer and cosine_similarity. TfidfVectorizer is used in natural language processing, or NLP, and cosine_similarities is used to compare arrays. TfidfVectorizer converts a collection of raw documents to a matrix of Tfidf features.
This can get a little tedious and in particular makes pipelines more verbose. It'd be nice if TfidfVectorizer could also operate on multiple text columns, using the same settings for each column, perhaps with an option to make one vocabulary per column, or use a shared vocabulary across all the columns.
What is the difference between a hashing vectorizer and a tfidf vectorizer 1 Sklearn tfidf vectorize returns different shape after fit_transform()
This is not directly possible because there is no way to parallelize/distribute access to the vocabulary that is needed for these vectorizers.
To perform parallel document vectorization, use the HashingVectorizer
instead. The scikit docs provide an example using this vectorizer to train (and evaluate) a classifier in batches. A similar workflow also works for parallelization because input terms are mapped to the same vector indices without any communication between the parallel workers.
Simply compute the partial term-doc matrices separately and concatenate them once all jobs are done. At this point you may also run TfidfTransformer
on the concatenated matrix.
The most significant drawback of not storing the vocabulary of input terms, is that it is difficult to find out which terms are mapped to which column in the final matrix (i.e. inverse transform). The only efficient mapping is to use the hashing function on a term to see which column/index it is assigned to. For an inverse transform, you would need to do this for all unique terms (i.e. your vocabulary).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With