Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-Threaded NLP with Spacy pipe

I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example:

from spacy.en import English

input = open("big_file.txt")
big_text= input.read()
input.close()

nlp= English()    

out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next() 

Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of everything you need in one step. Applying pipe method tho is supposed to make the process a lot faster by multithreading the expensive parts of the pipeline. But I don't see big improvement in speed and my CPU usage is around 25% (only one of 4 cores working). I also tried to read the file in multiple chuncks and increase the batch of input texts:

out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)

but still the same performance. Is there anyway to speed up the process? I suspect that OpenMP feature should be enabled compiling Spacy to utilize multi-threading feature. But there is no instructions on how to do it on Windows.

like image 493
SJ Bey Avatar asked Apr 08 '16 21:04

SJ Bey


People also ask

What is NLP pipe in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.

What is En_core_web_sm spaCy?

For example, en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.

What is .spaCy file?

Typically, the extension for these binary files is . spacy , and they are used as input format for specifying a training corpus and for spaCy's CLI train command. The built-in convert command helps you convert spaCy's previous JSON format to the new binary format.

What does spaCy load (' en ') do?

Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.


1 Answers

I figured what the problem was. OpenMP is the package used in implementing multithreading for spacy pipe() method. This option is disabled for MSVC compiler by default. After I compiled the source code with openmp support it works great. I also made a pull request to enable this on the next releases. So for releases after 0.100.7 (which is the latest version) multithreading with pipe() should work on Windows with no issue.

like image 157
SJ Bey Avatar answered Sep 28 '22 02:09

SJ Bey