Using RegEx for phrase pattern in EntityRuler

Tags:

spacy

I tried to find FRT entity with EntityRuler like this:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

I then got this outcome

[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]

Could you please show me how to fix my code so that I will get this result

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Thank you in advance.

499

asked Aug 27 '19 04:08

2 Answers

You need to fix the whole code by using this patterns declaration:

patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

There are two things: 1) the REGEX operator itself does not work if you do not define it under the TEXT, LOWER, etc. top-level token and 2) the regex you are using is corrupt as you are using a character class instead of a grouping construct.

Note that [e|es], being a regex character class, matches e, s or |. So, if you have a Appl| is red. string, the result will contain [('Appl|', 'FRT'). You need to either use a non-capturing group - (?:es|s), or just es? that matches an e and then an optional s.

Also, cf. these scenarios:

[{"TEXT" : {"REGEX": "[Aa]pples?"}}] will find Apple, apple, Apples, apples, but will not find APPLES
[{"LOWER" : {"REGEX": "apples?"}}] will find Apple, apple, Apples, apples, APPLES, aPPleS, etc. and also stapples (a misspelling of staples)
[{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will find Apple, apple, Apples, apples, but will not find APPLES, nor stapples since \b are word boundaries.

answered Oct 02 '22 15:10

Wiktor Stribiżew

You have missed the top-level token attribute which you are trying to match in your regex. Since the top-lever token attribute is missed the REGEX key is ignored and the pattern is interpreted as "any token"

Working code

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

Output

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Infact you can also used bellow pattern for apple

{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}

answered Oct 02 '22 16:10

mujjiga

Related questions
                            
                                Pause Jupyter Notebook widgets, waiting for user input
                            
                                How to compile the resources.qrc file with pyrcc5
                            
                                Best way to combine a permutation of conditional statements
                            
                                How to get decision function in randomforest in sklearn
                            
                                Remove rows of a dataframe based on the row number
                            
                                Python Fuzzy matching strings in list performance
                            
                                Disabling `@tf.function` decorators for debugging?
                            
                                How exactly does inspect.signature work with classes?
                            
                                Retrieve definition for parenthesized abbreviation, based on letter count
                            
                                Assigning a scalar value to an empty DataFrame doesn't appear to do anything
                            
                                json.loads() returns a string
                            
                                Error 429 with simple query on google with requests python
                            
                                What does a red triangle mean in Visual Studio Code?
                            
                                How to send an image directly from flask server to html?
                            
                                How to print the type annotations of a function in Python?
                            
                                ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy
                            
                                What is the correct way to use distinct on (Postgres) with SqlAlchemy?
                            
                                How to convert video on python to .mp4 without ffmpeg?
                            
                                Creating a ragged tensor from a list of tensors
                            
                                Pandas Groupby: 'observed' parameter with multiple categoricals

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using RegEx for phrase pattern in EntityRuler

Tags:

python

spacy

Nemo

People also ask

2 Answers

Wiktor Stribiżew

Working code

mujjiga

Recent Activity

Donate For Us