The following link shows how to add custom entity rule where the entities span more than one token. The code to do that is below:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)
doc = nlp("There is no cat in the house and no artic fox in the basement")
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end])
I tried to add another custom entity ruler as follows:
flower = ["rose", "tulip", "african daisy"]
ruler = EntityRuler(nlp)
for f in flower:
ruler.add_patterns([{"label": "flower", "pattern": f}])
nlp.add_pipe(ruler)
but I got this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-47-702f460a866f> in <module>()
4 for f in flower:
5 ruler.add_patterns([{"label": "flower", "pattern": f}])
----> 6 nlp.add_pipe(ruler)
7
~\AppData\Local\Continuum\anaconda3\lib\site-packages\spacy\language.py in add_pipe(self, component, name, before, after, first, last)
296 name = repr(component)
297 if name in self.pipe_names:
--> 298 raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
299 if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
300 raise ValueError(Errors.E006)
ValueError: [E007] 'entity_ruler' already exists in pipeline. Existing names: ['tagger', 'parser', 'ner', 'entity_ruler']
My questions are:
How can I add another custom entity ruler?
Is it a best practice to use capital letters for the label (for example, instead of ruler.add_patterns([{"label": "animal", "pattern": a}])
one should use ruler.add_patterns([{"label": "ANIMAL", "pattern": a}])
instead?
You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])
rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)
doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
print(ent.text , '->', ent.label_)
#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower
We can verify that the pipeline does contain both entity rulers:
print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']
Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:
rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)
Finally concerning your question about best practices for entity labels, it is a common practice to use abbreviations written with capital letters (see Spacy NER documentation) for example ORG, LOC, PERSON, etc..
Edits following questions:
1)If you do not need Spacy's default Named Entity Recognition (NER), then I would suggest disabling it as that will speedup computations and avoid interference (see discussion about this here). Disabling NER will not cause unexpected downstream results (your document just won't be tagged for the default entities LOC, ORG, PERSON etc..).
2)There is this idea in programming that "Simple is better than complex." (see here). There can be some subjectivity as to what constitutes a simpler solution. I would think that a processing pipeline with fewer components is simpler (i.e. the pipeline containing both entity rulers would seem more complex to me). However depending on your needs in terms of profiling, adjustability etc.. It might be simpler for you have several different entity rulers as described in the first part of this solution. It would be nice to get the author's of Spacy to give their view on these two different design choices.
3) Naturally, the single entity ruler above can be directly created as follows:
rulerAll = EntityRuler(nlp, overwrite_ents=True)
for f in flowers:
rulerAll.add_patterns([{"label": "flower", "pattern": f}])
for a in animals:
rulerAll.add_patterns([{"label": "animal", "pattern": a}])
The other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it. In practice we would construct rulerAll directly without first constructing rulerPlant and rulerAnimal. Unless we wanted to test and profile these (rulerPlant and rulerAnimal) individually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With