I tried to find FRT
entity with EntityRuler like this:
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
{"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])
I then got this outcome
[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]
Could you please show me how to fix my code so that I will get this result
[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]
Thank you in advance.
The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.
The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler. The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels.
You need to fix the whole code by using this patterns
declaration:
patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
{"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
There are two things: 1) the REGEX
operator itself does not work if you do not define it under the TEXT
, LOWER
, etc. top-level token and 2) the regex you are using is corrupt as you are using a character class instead of a grouping construct.
Note that [e|es]
, being a regex character class, matches e
, s
or |
. So, if you have a Appl| is red.
string, the result will contain [('Appl|', 'FRT')
. You need to either use a non-capturing group - (?:es|s)
, or just es?
that matches an e
and then an optional s
.
Also, cf. these scenarios:
[{"TEXT" : {"REGEX": "[Aa]pples?"}}]
will find Apple
, apple
, Apples
, apples
, but will not find APPLES
[{"LOWER" : {"REGEX": "apples?"}}]
will find Apple
, apple
, Apples
, apples
, APPLES
, aPPleS
, etc. and also stapples
(a misspelling of staples
)[{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}]
will find Apple
, apple
, Apples
, apples
, but will not find APPLES
, nor stapples
since \b
are word boundaries.You have missed the top-level token attribute which you are trying to match in your regex. Since the top-lever token attribute is missed the REGEX key is ignored and the pattern is interpreted as "any token"
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
{"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])
Output
[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]
Infact you can also used bellow pattern for apple
{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With