Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up Spacy Named Entity Recognition

Tags:

python

nlp

spacy

I'm using spacy to recognize street addresses on web pages.

My model is initialized basically using spacy's new entity type sample code found here: https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py

My training data consists of plain text webpages with their corresponding Street Address entities and character positions.

I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very slow.

My code works by iterating through serveral raw HTML pages and then feeding each page's plain text version into spacy as it's iterating. For reasons I can't get into, I need to make predictions with Spacy page by page, inside of the iteration loop.

After the model is loaded, I'm using the standard way of making predictions, which I'm referring to as the prediction/evaluation phase:

  doc = nlp(plain_text_webpage)

  if len(doc.ents) > 0:

         print ("found entity")

Questions:

  1. How can I speed up the entity prediction / recognition phase? I'm using a c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data. Spacy is turning processing a few million webpages from a 1 minute job to a 1 hour+ job.

  2. Will the speed of entity recognition improve as my model becomes more accurate?

  3. Is there a way to remove pipelines like tagger during this phase, can ER be decoupled like that and still be accurate? Will removing other pipelines affect the model itself or is it just a temporary thing?

  4. I saw that you can use GPU during the ER training phase, can it also be used in this evaluating phase in my code for faster predictions?


Update:

I managed to significantly cut down the processing time by:

  1. Using a custom tokenizer (used the one in the docs)

  2. Disabling other pipelines that aren't for Named Entity Recognition

  3. Instead of feeding the whole body of text from each webpage into spacy, I'm only sending over a maximum of 5,000 characters

My updated code to load the model:

nlp = spacy.load('test_model/', disable=['parser', 'tagger', 'textcat'])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(text)

However, it is still too slow (20X slower than I need it)

Questions:

  1. Are there any other improvements I can make to speed up the Named Entity Recognition? Any fat I can cut from spacy?

  2. I'm still looking to see if a GPU based solution would help - I saw that GPU use is supported during the Named Entity Recognition training phase, can it also be used in this evaluation phase in my code for faster predictions?

like image 427
podcastguy Avatar asked Apr 06 '18 23:04

podcastguy


People also ask

Is spaCy faster than NLTK?

Some tasks are efficient in spaCy, and some are efficient in NLTK. We need to use both libraries in different circumstances, and let's see some comparisons. The above figure clearly indicates that spaCy is 20 times and 443 times faster than NLTK in the case of tokenization and tagging.


1 Answers

Please see here for details about speed troubleshooting: https://github.com/explosion/spaCy/issues/1508

The most important things:

1) Check which BLAS library numpy is linked against, and make sure it's compiled well for your machine. Using conda is helpful as then you get Intel's mkl

2)

c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data.

That's probably bad. We can only really parallelise the matrix multiplications at the moment, because we're using numpy --- so there's no way to thread larger chunks. This means the BLAS library is probably launching too many threads. In general you can only profitably use 3-4 cores per process. Try setting the environment variables for your BLAS library to restrict the number of threads.

3) Use nlp.pipe(), to process batches of data. This makes the matrix multiplications bigger, making processing more efficient.

4) Your outer loop of "feed data through my processing pipeline" is probably embarrassingly parallel. So, parallelise it. Either use Python's multiprocessing, or something like joblib, or something like Spark, or just fire off 10 bash scripts in parallel. But take the outermost, highest level chunk of work you can, and run it as independently as possible.

It's actually usually better to run multiple smaller VMs instead of one large VM. It's annoying operationally, but it means less resource sharing.

like image 100
syllogism_ Avatar answered Nov 08 '22 14:11

syllogism_