Speed up Spacy Named Entity Recognition

Tags:

I'm using spacy to recognize street addresses on web pages.

My model is initialized basically using spacy's new entity type sample code found here: https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py

My training data consists of plain text webpages with their corresponding Street Address entities and character positions.

I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very slow.

My code works by iterating through serveral raw HTML pages and then feeding each page's plain text version into spacy as it's iterating. For reasons I can't get into, I need to make predictions with Spacy page by page, inside of the iteration loop.

After the model is loaded, I'm using the standard way of making predictions, which I'm referring to as the prediction/evaluation phase:

  doc = nlp(plain_text_webpage)

  if len(doc.ents) > 0:

         print ("found entity")

Questions:

How can I speed up the entity prediction / recognition phase? I'm using a c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data. Spacy is turning processing a few million webpages from a 1 minute job to a 1 hour+ job.
Will the speed of entity recognition improve as my model becomes more accurate?
Is there a way to remove pipelines like tagger during this phase, can ER be decoupled like that and still be accurate? Will removing other pipelines affect the model itself or is it just a temporary thing?
I saw that you can use GPU during the ER training phase, can it also be used in this evaluating phase in my code for faster predictions?

Update:

I managed to significantly cut down the processing time by:

Using a custom tokenizer (used the one in the docs)
Disabling other pipelines that aren't for Named Entity Recognition
Instead of feeding the whole body of text from each webpage into spacy, I'm only sending over a maximum of 5,000 characters

My updated code to load the model:

nlp = spacy.load('test_model/', disable=['parser', 'tagger', 'textcat'])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(text)

However, it is still too slow (20X slower than I need it)

Questions:

Are there any other improvements I can make to speed up the Named Entity Recognition? Any fat I can cut from spacy?
I'm still looking to see if a GPU based solution would help - I saw that GPU use is supported during the Named Entity Recognition training phase, can it also be used in this evaluation phase in my code for faster predictions?

427

asked Apr 06 '18 23:04

podcastguy

1 Answers

Please see here for details about speed troubleshooting: https://github.com/explosion/spaCy/issues/1508

The most important things:

1) Check which BLAS library numpy is linked against, and make sure it's compiled well for your machine. Using conda is helpful as then you get Intel's mkl

c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data.

That's probably bad. We can only really parallelise the matrix multiplications at the moment, because we're using numpy --- so there's no way to thread larger chunks. This means the BLAS library is probably launching too many threads. In general you can only profitably use 3-4 cores per process. Try setting the environment variables for your BLAS library to restrict the number of threads.

3) Use nlp.pipe(), to process batches of data. This makes the matrix multiplications bigger, making processing more efficient.

4) Your outer loop of "feed data through my processing pipeline" is probably embarrassingly parallel. So, parallelise it. Either use Python's multiprocessing, or something like joblib, or something like Spark, or just fire off 10 bash scripts in parallel. But take the outermost, highest level chunk of work you can, and run it as independently as possible.

It's actually usually better to run multiple smaller VMs instead of one large VM. It's annoying operationally, but it means less resource sharing.

100

answered Nov 08 '22 14:11

syllogism_

Related questions
                            
                                How can I debug a python code in a virtual environment using VSCode?
                            
                                VS Code Python autopep8 does not honor 2 spaces hanging indentation
                            
                                IronPython Webframework
                            
                                PyQT GUI Testing
                            
                                Is there a plugin for vim to auto-import python libraries? [closed]
                            
                                What is the easiest way to generate a Control Flow-Graph for a method in Python?
                            
                                semantics of __module__
                            
                                Using FieldList and FormField
                            
                                Optimizing the size of embedded Python interpreter
                            
                                How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]
                            
                                IPython Notebook: how to display() multiple objects without newline
                            
                                Determining if a python subprocess segmentation faults
                            
                                Heroku: Python dependencies in private repos without storing my password
                            
                                Trie tree match performance in word search
                            
                                Python - create an EXE that runs code as written, not as it was when compiled
                            
                                How to check if a dict value contains a word/string? [duplicate]
                            
                                How to create a view of dataframe in pandas?
                            
                                Background tasks in flask
                            
                                Remove only formatting on a cell range selection with google spreadsheet API
                            
                                Import error: you must be root

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Speed up Spacy Named Entity Recognition

Tags:

python

nlp

spacy

podcastguy

People also ask

1 Answers

syllogism_

Recent Activity

Donate For Us