Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using RegEx for phrase pattern in EntityRuler

Tags:

python

spacy

I tried to find FRT entity with EntityRuler like this:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

I then got this outcome

[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]

Could you please show me how to fix my code so that I will get this result

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Thank you in advance.

like image 499
Nemo Avatar asked Aug 27 '19 04:08

Nemo


People also ask

What is entity ruler?

The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.

What is an entity ruler Python?

The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler. The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels.


2 Answers

You need to fix the whole code by using this patterns declaration:

patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

There are two things: 1) the REGEX operator itself does not work if you do not define it under the TEXT, LOWER, etc. top-level token and 2) the regex you are using is corrupt as you are using a character class instead of a grouping construct.

Note that [e|es], being a regex character class, matches e, s or |. So, if you have a Appl| is red. string, the result will contain [('Appl|', 'FRT'). You need to either use a non-capturing group - (?:es|s), or just es? that matches an e and then an optional s.

Also, cf. these scenarios:

  • [{"TEXT" : {"REGEX": "[Aa]pples?"}}] will find Apple, apple, Apples, apples, but will not find APPLES
  • [{"LOWER" : {"REGEX": "apples?"}}] will find Apple, apple, Apples, apples, APPLES, aPPleS, etc. and also stapples (a misspelling of staples)
  • [{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will find Apple, apple, Apples, apples, but will not find APPLES, nor stapples since \b are word boundaries.
like image 94
Wiktor Stribiżew Avatar answered Oct 02 '22 15:10

Wiktor Stribiżew


You have missed the top-level token attribute which you are trying to match in your regex. Since the top-lever token attribute is missed the REGEX key is ignored and the pattern is interpreted as "any token"

Working code

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

Output

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Infact you can also used bellow pattern for apple

{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}

like image 37
mujjiga Avatar answered Oct 02 '22 16:10

mujjiga