Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trivial example using spaCy Matcher not working

Tags:

spacy

I'm trying to get the following simple example using the spaCy Matcher working:

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)

pattern1 = [{'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}]
pattern2 = [{'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}]
pattern3 = [{'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}]

matcher.add('IP', None, pattern1, pattern2, pattern3)

doc = nlp(u'This is an IP address: 192.168.1.1')

matches = matcher(doc)

However, none of the patterns are matching and this code returns [] for matches. The simple "Hello World" example provided in the spaCy sample code works fine.

What am I doing wrong?

like image 432
BriWill Avatar asked Nov 30 '17 21:11

BriWill


People also ask

How does spaCy Matcher work?

spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_ , and flags like IS_PUNCT ).

What is phrase matcher in spaCy?

The PhraseMatcher lets you efficiently match large terminology lists. While the Matcher lets you match sequences based on lists of token descriptions, the PhraseMatcher accepts match patterns in the form of Doc objects. See the usage guide for examples.

What is matcher in NLP?

The Matcher lets you find words and phrases using rules describing their token attributes. Rules can refer to token annotations (like the text or part-of-speech tags), as well as lexical attributes like Token. is_punct . Applying the matcher to a Doc gives you access to the matched tokens in context.

What is entity ruler?

The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.


1 Answers

When using the Matcher, keep in mind that each dictionary in the pattern represents one individual token. This also means that the matches it finds depends on how spaCy tokenizes your text. By default, spaCy's English tokenizer will split your example text like this:

>>> doc = nlp("This is an IP address: 192.168.1.1")
>>> [t.text for t in doc]
['This', 'is', 'an', 'IP', 'address', ':', '192.168.1.1']

192.168.1.1 stays one token (which, objectively, is probably quite reasonable – an IP address could be considered a word). So the match pattern that expects parts of it to be individual tokens won't match.

In order to change this behaviour, you could customise the tokenizer with an additional rule that tells spaCy to split periods between numbers. However, this might also produce other, unintended side effects.

So a better approach in your case would be to work with the token shape, available as the token.shape_ attribute. The shape is a string representation of the token that describes the individual characters, and whether they contain digits, uppercase/lowercase characters and punctuation. The IP address shape looks like this:

>>> ip_address = doc[6]
>>> ip_address.shape_
'ddd.ddd.d.d'

You can either just filter your document and check that token.shape_ == 'ddd.ddd.d.d', or use the 'SHAPE' as a key in your match pattern (for a single token) to find sentences or phrases containing tokens of that shape.

like image 123
Ines Montani Avatar answered Sep 27 '22 00:09

Ines Montani