In this part of the documentation, it is mentioned that nlp.pipe()
works in parallel and the following example is given:
for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
pass
After that, another, longer example is given that makes use of joblib. I don't quite understand the relationship between the two. As I understand the documentation, if I simply want to parallelize the tokenisation of many documents, the above, simple for loop will work and I do not have to use joblib, right?
My pipeline looks like this:
nlp = spacy.load('en', disable=['parser', 'ner', 'textcat'])
When would I need to use joblib?
Based on an answer in Spacy github issues:
We kept the n_threads argument to avoid breaking people's code, but unfortunately the implementation doesn't currently release the GIL, the way we did in v1. In v2 the neural network model is more complicated and more subject to change, so we haven't implemented it in Cython. We might at a later date.
In v2.1.0 (you can get an alpha by installing spacy-nightly, the matrix multiplications are now single-threaded. This makes it safe to launch multiple processes for the pipeline, so we can look at doing that internally. In the meantime, the n_threads argument sits idle...Which I agree is confusing, but removing it and breaking backwards compatibility seems worse.
Thus, to summarize: n_threads doesn't work in v2.1. What I'm doing now is using Spacy with joblib to read a dataset in minibatches.
Spacy released an example for that: Spacy Multiprocessing, and it works perfectly.
I have a dataset with almost ~4M short text. Without using the example they released, it took almost 23 hours to finish parsing them, but using joblib with spacy, it took 1 hour and half to finish!
To reference the readers of this question to Spacy Multiprocessing example: Spacy Multiprocessing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With