Trivial example using spaCy Matcher not working

Tags:

spacy

I'm trying to get the following simple example using the spaCy Matcher working:

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)

pattern1 = [{'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}, {'ORTH': '.'}, {'IS_DIGIT': True}]
pattern2 = [{'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}, {'ORTH': '.'}, {'LIKE_NUM': True}]
pattern3 = [{'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}, {'IS_PUNCT': True}, {'IS_DIGIT': True}]

matcher.add('IP', None, pattern1, pattern2, pattern3)

doc = nlp(u'This is an IP address: 192.168.1.1')

matches = matcher(doc)

However, none of the patterns are matching and this code returns [] for matches. The simple "Hello World" example provided in the spaCy sample code works fine.

What am I doing wrong?

432

asked Nov 30 '17 21:11

1 Answers

When using the Matcher, keep in mind that each dictionary in the pattern represents one individual token. This also means that the matches it finds depends on how spaCy tokenizes your text. By default, spaCy's English tokenizer will split your example text like this:

>>> doc = nlp("This is an IP address: 192.168.1.1")
>>> [t.text for t in doc]
['This', 'is', 'an', 'IP', 'address', ':', '192.168.1.1']

192.168.1.1 stays one token (which, objectively, is probably quite reasonable – an IP address could be considered a word). So the match pattern that expects parts of it to be individual tokens won't match.

In order to change this behaviour, you could customise the tokenizer with an additional rule that tells spaCy to split periods between numbers. However, this might also produce other, unintended side effects.

So a better approach in your case would be to work with the token shape, available as the token.shape_ attribute. The shape is a string representation of the token that describes the individual characters, and whether they contain digits, uppercase/lowercase characters and punctuation. The IP address shape looks like this:

>>> ip_address = doc[6]
>>> ip_address.shape_
'ddd.ddd.d.d'

You can either just filter your document and check that token.shape_ == 'ddd.ddd.d.d', or use the 'SHAPE' as a key in your match pattern (for a single token) to find sentences or phrases containing tokens of that shape.

123

answered Sep 27 '22 00:09

Ines Montani

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trivial example using spaCy Matcher not working

Tags:

spacy

BriWill

People also ask

1 Answers

Ines Montani

Recent Activity

Donate For Us

Trivial example using spaCy Matcher not working

Tags:

spacy

BriWill

People also ask

1 Answers

Ines Montani

Related questions

Recent Activity

Donate For Us