I'm trying to get the following simple example using the spaCy Matcher working:
import en_core_web_sm
from spacy.matcher import Matcher
nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)
pattern1 = [{'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}]
pattern2 = [{'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}]
pattern3 = [{'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}]
matcher.add('IP', None, pattern1, pattern2, pattern3)
doc = nlp(u'This is an IP address: 192.168.1.1')
matches = matcher(doc)
However, none of the patterns are matching and this code returns []
for matches
. The simple "Hello World" example provided in the spaCy sample code works fine.
What am I doing wrong?
spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_ , and flags like IS_PUNCT ).
The PhraseMatcher lets you efficiently match large terminology lists. While the Matcher lets you match sequences based on lists of token descriptions, the PhraseMatcher accepts match patterns in the form of Doc objects. See the usage guide for examples.
The Matcher lets you find words and phrases using rules describing their token attributes. Rules can refer to token annotations (like the text or part-of-speech tags), as well as lexical attributes like Token. is_punct . Applying the matcher to a Doc gives you access to the matched tokens in context.
The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.
When using the Matcher
, keep in mind that each dictionary in the pattern represents one individual token. This also means that the matches it finds depends on how spaCy tokenizes your text. By default, spaCy's English tokenizer will split your example text like this:
>>> doc = nlp("This is an IP address: 192.168.1.1")
>>> [t.text for t in doc]
['This', 'is', 'an', 'IP', 'address', ':', '192.168.1.1']
192.168.1.1
stays one token (which, objectively, is probably quite reasonable – an IP address could be considered a word). So the match pattern that expects parts of it to be individual tokens won't match.
In order to change this behaviour, you could customise the tokenizer with an additional rule that tells spaCy to split periods between numbers. However, this might also produce other, unintended side effects.
So a better approach in your case would be to work with the token shape, available as the token.shape_
attribute. The shape is a string representation of the token that describes the individual characters, and whether they contain digits, uppercase/lowercase characters and punctuation. The IP address shape looks like this:
>>> ip_address = doc[6]
>>> ip_address.shape_
'ddd.ddd.d.d'
You can either just filter your document and check that token.shape_ == 'ddd.ddd.d.d'
, or use the 'SHAPE'
as a key in your match pattern (for a single token) to find sentences or phrases containing tokens of that shape.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With