Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In spacy, Is it possible to get the corresponding rule id in a match of matches

Tags:

nlp

matcher

spacy

In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0' for example). During parse, I use the callback on_match to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.

Here is my sample code.

txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
       "de cacahuète, c'est un pilier de ma nourriture "
       "quotidienne.")

nlp = spacy.load('fr')

def on_match(matcher, doc, id, matches):
    span = doc[matches[id][1]:matches[id][2]]
    print(span)
    # find a way to get the corresponding rule without fuzz

matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])

doc = nlp(txt)
matches = matcher(doc)

In this case matches return :

[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]

12071893341338447867 is a unique ID based on class-1_0. I cannot find the original rule name, even if I do some introspection in matcher._patterns.

It would be great if someone can help me. Thank you very much.

like image 525
k3z Avatar asked Nov 26 '17 07:11

k3z


People also ask

What is rule-based matching in spaCy?

Compared to using regular expressions on raw text, spaCy's rule-based matcher engines and components not only let you find the words and phrases you're looking for – they also give you access to the tokens within the document and their relationships.

What is ruled grammar matching?

Rule-Based Grammar and spell checking is the task of checking the content of any specific language using a particular set of rules for their grammar and a set of spellings for spell checking.

What does spaCy NLP () do?

spaCy is a free, open-source library for NLP in Python. It's written in Cython and is designed to build information extraction or natural language understanding systems. It's built for production use and provides a concise and user-friendly API.

What is entity ruler in spaCy?

The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.


1 Answers

Yes – you can simply look up the ID in the StringStore of your vocabulary, available via nlp.vocab.strings or doc.vocab.strings. Going via the Doc is pretty convenient here, because you can do so within your on_match callback:

def on_match(matcher, doc, match_id, matches):
   string_id = doc.vocab.strings[match_id]

For efficiency, spaCy encodes all strings to integers and keeps a reference to the mapping in the StringStore lookup table. In spaCy v2.0, the integers are hash values, so they'll always match across models and vocabularies. Fore more details on this, see this section in the docs.

Of course, if your classes and IDs are kinda cryptic anyways, the other answer suggesting integer IDs will work fine, too. Just keep in mind that those integer IDs you choose will likely also be mapped to some random string in the StringStore (like a word, or a part-of-speech tag or something). This usually doesn't matter if you're not looking them up and resolving them to strings somewhere – but if you do, the output may be confusing. For example, if your matcher rule ID is 99 and you're calling doc.vocab.strings[99], this will return 'VERB'.

like image 198
Ines Montani Avatar answered Oct 19 '22 20:10

Ines Montani