I am trying to train new entities for spacy NER. I tried adding my new entity to existing spacy 'en' model. However, this affected the prediction model for both 'en'
and my new entity.
I, therefore, created a blank model and trained the entity recognition. This works well. However, it is capable of predicting only the ones I have trained for and not the regular spacy entity recognition.
Say I trained 'horses' as ANIMAL entity.
For a given text
txt ='Did you know that George bought those horses for 10000 dollars?'
am expecting the following entities to be recognized
George - PERSON
horses - ANIMAL
10000 dollars - MONEY.
With my current setup, it only recognized horses.
nlp = spacy.load('en')
hsnlp = spacy.load('models/spacy/animal/')
nlp.add_pipe(hsnlp.pipeline[-1][-1], 'hsner')
nlp.pipe_names
this gives
----------------------
['tagger', 'parser', 'ner', 'hsner']
----------------------
However when I try to execute
doc = nlp(txt) *<-- Gives me kernel error and stops working*
Please let me know how to create a pipeline for NER in spacy effectively. Am using spacy 2.0.18
spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks.
The Spacy NER system contains a word embedding strategy using sub word features and "Bloom" embed, and a deep convolution neural network with residual connections. The system is designed to give a good balance of efficiency, accuracy and adaptability.
EntityRuler() allows you to create your own entities to add to a spaCy pipeline. You start by creating an instance of EntityRuler() and passing it the current pipeline, nlp . You can then call add_patterns() on the instance and pass it a dictionary of the text pattern you'd like to label with an entity.
The main issue is how to load and combine pipeline components such that they are using the same Vocab
(nlp.vocab
), since a pipeline assumes that all components share the same vocab and otherwise you can get errors related to the StringStore
.
You shouldn't try to combine pipeline components that were trained with different word vectors, but as long as the vectors are the same it's a question of how to load components from separate models with the same vocab.
There's no way to do this with spacy.load()
, so I think the simplest option is to initialize a new pipeline component with the required vocab and reload the existing component into the new component by temporarily serializing it.
To have a short working demo with easily accessible models, I'll show how to add the German NER model from de_core_news_sm
to the English model en_core_web_sm
even though it's not something you'd typically want to do:
import spacy # tested with v2.2.3
from spacy.pipeline import EntityRecognizer
text = "Jane lives in Boston. Jan lives in Bremen."
# load the English and German models
nlp_en = spacy.load('en_core_web_sm') # NER tags PERSON, GPE, ...
nlp_de = spacy.load('de_core_news_sm') # NER tags PER, LOC, ...
# the Vocab objects are not the same
assert nlp_en.vocab != nlp_de.vocab
# but the vectors are identical (because neither model has vectors)
assert nlp_en.vocab.vectors.to_bytes() == nlp_de.vocab.vectors.to_bytes()
# original English output
doc1 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc1.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Bremen', 'GPE')]
# original German output (the German model makes weird predictions for English text)
doc2 = nlp_de(text)
print([(ent.text, ent.label_) for ent in doc2.ents])
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]
# initialize a new NER component with the vocab from the English pipeline
ner_de = EntityRecognizer(nlp_en.vocab)
# reload the NER component from the German model by serializing
# without the vocab and deserializing using the new NER component
ner_de.from_bytes(nlp_de.get_pipe("ner").to_bytes(exclude=["vocab"]))
# add the German NER component to the end of the English pipeline
nlp_en.add_pipe(ner_de, name="ner_de")
# check that they have the same vocab
assert nlp_en.vocab == ner_de.vocab
# combined output (English NER runs first, German second)
doc3 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc3.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Jan lives', 'PER'), ('Bremen', 'GPE')]
Spacy's NER components (EntityRuler
and EntityRecognizer
) are designed to preserve any existing entities, so the new component only adds Jan lives
with the German NER tag PER
and leaves all other entities as predicted by the English NER.
You can use options for add_pipe()
to determine where the component is inserted in the pipeline. To add the German NER before the default English NER:
nlp_en.add_pipe(ner_de, name="ner_de", before="ner")
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]
All the add_pipe()
options are in the docs: https://spacy.io/api/language#add_pipe
You can save the extended pipeline to disk as a single model so you can load it in one line with spacy.load()
the next time:
nlp_en.to_disk("/path/to/model")
nlp_reloaded = spacy.load("/path/to/model")
print(nlp_reloaded.pipe_names) # ['tagger', 'parser', 'ner', 'ner_de']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With