I tried to find FRT entity with EntityRuler like this:
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])
I then got this outcome
[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]
Could you please show me how to fix my code so that I will get this result
[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]
Thank you in advance.
The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.
The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler. The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels.
You need to fix the whole code by using this patterns declaration:
patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
There are two things: 1) the REGEX operator itself does not work if you do not define it under the TEXT, LOWER, etc. top-level token and 2) the regex you are using is corrupt as you are using a character class instead of a grouping construct.
Note that [e|es], being a regex character class, matches e, s or |. So, if you have a Appl| is red. string, the result will contain [('Appl|', 'FRT'). You need to either use a non-capturing group - (?:es|s), or just es? that matches an e and then an optional s.
Also, cf. these scenarios:
[{"TEXT" : {"REGEX": "[Aa]pples?"}}] will find Apple, apple, Apples, apples, but will not find APPLES
[{"LOWER" : {"REGEX": "apples?"}}] will find Apple, apple, Apples, apples, APPLES, aPPleS, etc. and also stapples (a misspelling of staples)[{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will find Apple, apple, Apples, apples, but will not find APPLES, nor stapples since \b are word boundaries.You have missed the top-level token attribute which you are trying to match in your regex. Since the top-lever token attribute is missed the REGEX key is ignored and the pattern is interpreted as "any token"
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])
Output
[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]
Infact you can also used bellow pattern for apple
{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With