How to create NER pipeline with multiple models in Spacy

Tags:

I am trying to train new entities for spacy NER. I tried adding my new entity to existing spacy 'en' model. However, this affected the prediction model for both 'en' and my new entity.

I, therefore, created a blank model and trained the entity recognition. This works well. However, it is capable of predicting only the ones I have trained for and not the regular spacy entity recognition.

Say I trained 'horses' as ANIMAL entity.

For a given text

txt ='Did you know that George bought those horses for 10000 dollars?'

am expecting the following entities to be recognized

George - PERSON
horses - ANIMAL
10000 dollars - MONEY.

With my current setup, it only recognized horses.

nlp = spacy.load('en')
hsnlp = spacy.load('models/spacy/animal/')
nlp.add_pipe(hsnlp.pipeline[-1][-1], 'hsner')

nlp.pipe_names

this gives

----------------------
['tagger', 'parser', 'ner', 'hsner']
----------------------

However when I try to execute

doc = nlp(txt)  *<-- Gives me kernel error and stops working*

Please let me know how to create a pipeline for NER in spacy effectively. Am using spacy 2.0.18

899

asked Feb 24 '19 19:02

Suvin K S

1 Answers

The main issue is how to load and combine pipeline components such that they are using the same Vocab (nlp.vocab), since a pipeline assumes that all components share the same vocab and otherwise you can get errors related to the StringStore.

You shouldn't try to combine pipeline components that were trained with different word vectors, but as long as the vectors are the same it's a question of how to load components from separate models with the same vocab.

There's no way to do this with spacy.load(), so I think the simplest option is to initialize a new pipeline component with the required vocab and reload the existing component into the new component by temporarily serializing it.

To have a short working demo with easily accessible models, I'll show how to add the German NER model from de_core_news_sm to the English model en_core_web_sm even though it's not something you'd typically want to do:

import spacy # tested with v2.2.3
from spacy.pipeline import EntityRecognizer

text = "Jane lives in Boston. Jan lives in Bremen."

# load the English and German models
nlp_en = spacy.load('en_core_web_sm')  # NER tags PERSON, GPE, ...
nlp_de = spacy.load('de_core_news_sm') # NER tags PER, LOC, ...

# the Vocab objects are not the same
assert nlp_en.vocab != nlp_de.vocab

# but the vectors are identical (because neither model has vectors)
assert nlp_en.vocab.vectors.to_bytes() == nlp_de.vocab.vectors.to_bytes()

# original English output
doc1 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc1.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Bremen', 'GPE')]

# original German output (the German model makes weird predictions for English text)
doc2 = nlp_de(text)
print([(ent.text, ent.label_) for ent in doc2.ents])
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

# initialize a new NER component with the vocab from the English pipeline
ner_de = EntityRecognizer(nlp_en.vocab)

# reload the NER component from the German model by serializing
# without the vocab and deserializing using the new NER component
ner_de.from_bytes(nlp_de.get_pipe("ner").to_bytes(exclude=["vocab"]))

# add the German NER component to the end of the English pipeline
nlp_en.add_pipe(ner_de, name="ner_de")

# check that they have the same vocab
assert nlp_en.vocab == ner_de.vocab

# combined output (English NER runs first, German second)
doc3 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc3.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Jan lives', 'PER'), ('Bremen', 'GPE')]

Spacy's NER components (EntityRuler and EntityRecognizer) are designed to preserve any existing entities, so the new component only adds Jan lives with the German NER tag PER and leaves all other entities as predicted by the English NER.

You can use options for add_pipe() to determine where the component is inserted in the pipeline. To add the German NER before the default English NER:

nlp_en.add_pipe(ner_de, name="ner_de", before="ner")
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

All the add_pipe() options are in the docs: https://spacy.io/api/language#add_pipe

You can save the extended pipeline to disk as a single model so you can load it in one line with spacy.load() the next time:

nlp_en.to_disk("/path/to/model")
nlp_reloaded = spacy.load("/path/to/model")
print(nlp_reloaded.pipe_names) # ['tagger', 'parser', 'ner', 'ner_de']

answered Oct 11 '22 17:10

aab

Related questions
                            
                                Import method from __init__.py
                            
                                Django: ContentTypes during migration while running tests
                            
                                Panda Dataframe Resampling based on column criteria
                            
                                How to apply outer product for tensors without unnecessary increase of dimensions?
                            
                                Keras + Tensorflow: Prediction on multiple gpus
                            
                                Python - Access to a protected member _ of a class
                            
                                Cross validation with grid search returns worse results than default
                            
                                Equivalent to python's -R option that affects the hash of ints
                            
                                How to find matplotlib style name?
                            
                                How to require a timestamp to be zero-padded during validation in Python?
                            
                                Making a C++ module part of a Python package
                            
                                Getting data from multiple databases with same tablenames in django
                            
                                Make Anaconda's tkinter aware of system fonts or install new fonts for Anaconda
                            
                                Inferring date format versus passing a parser
                            
                                Python: flatten nested lists with indices
                            
                                Python OpenCV video.get(cv2.CAP_PROP_FPS) returns 0.0 FPS
                            
                                Could Keras prefetch data like tensorflow Dataset?
                            
                                Remote Jupyter Notebook + Docker -- Doesn't Update File Directory?
                            
                                How to resize text for cv2.putText according to the image size in OpenCV, Python?
                            
                                Youtube API request credentials

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create NER pipeline with multiple models in Spacy

Tags:

python

named-entity-recognition

spacy

Suvin K S

People also ask

1 Answers

aab

Recent Activity

Donate For Us