I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example:
from spacy.en import English
input = open("big_file.txt")
big_text= input.read()
input.close()
nlp= English()
out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next()
Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of everything you need in one step. Applying pipe method tho is supposed to make the process a lot faster by multithreading the expensive parts of the pipeline. But I don't see big improvement in speed and my CPU usage is around 25% (only one of 4 cores working). I also tried to read the file in multiple chuncks and increase the batch of input texts:
out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)
but still the same performance. Is there anyway to speed up the process? I suspect that OpenMP feature should be enabled compiling Spacy to utilize multi-threading feature. But there is no instructions on how to do it on Windows.
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.
For example, en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.
Typically, the extension for these binary files is . spacy , and they are used as input format for specifying a training corpus and for spaCy's CLI train command. The built-in convert command helps you convert spaCy's previous JSON format to the new binary format.
Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.
I figured what the problem was. OpenMP is the package used in implementing multithreading for spacy pipe() method. This option is disabled for MSVC compiler by default. After I compiled the source code with openmp support it works great. I also made a pull request to enable this on the next releases. So for releases after 0.100.7 (which is the latest version) multithreading with pipe() should work on Windows with no issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With