The SpaCy documentation and samples show that the PhraseMatcher class is useful to match sequences of tokens in documents. One must provide a vocabulary of sequences that will be matched.
In my application, I have documents that are collections of tokens and phrases. There are entities of different types. The data is remotely natural language (documents are rather set of keywords with semi-random order). I am trying to find matches of multiple types.
For example:
yellow boots for kids
How can I find the matches for colors (e.g. yellow), for product types (e.g. boots) and for the age (e.g. kids) using SpaCy's PhraseMatches? Is this a good use case? If the different entity matches overlap (e.g. color is matched in colors list and in materials list), is it possible to produce all unique cases?
I cannot really use a sequence tagger as the data is loosely structured and is riddled with ambiguities. I have a list of entities (e.g. colors, ager, product types) and associated value lists.
One idea would be to instantiate multiple PhraseMatcher objects, one for each entity, do the matches separately and then merge the results. Each entity type will get its own vocabulary. This sounds straightforward but can be not efficient, especially the merging part. The value lists are fairly large. Before going this route, I would like to know if this is a good idea or perhaps there are simpler ways to do that with SpaCy.
spaCy's PhraseMatcher
supports adding multiple rules containing several patterns, and assigning IDs to each matcher rule you add. If two rules overlap, both matches will be returned. So you could do something like this:
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
When you call the matcher
on your doc
, spaCy will return a list of (match_id, start, end)
tuples. Because spaCy stores all strings as integers, the match_id
you get back will be an integer, too – but you can always get the string representation by looking it up in the vocabulary's StringStore
, i.e. nlp.vocab.strings
:
doc = nlp("yellow fabric")
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'COLOR'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
# COLOR yellow
# MATERIAL yellow fabric
When you add matcher rules, you can also define an on_match
callback function as the second argument of Matcher.add
. This is often useful if you want to trigger specific actions – for example, do one thing if a COLOR
match is found, and something else for a PRODUCT
match.
If you want to solve this even more elegantly, you might also want to look into combining your matcher with a custom pipeline component or custom attributes. For example, you could write a simple component that's run automatically when you call nlp()
on your text, finds the matches, and sets a Doc._.contains_product
or Token._.is_color
attribute. The docs have a few examples of this that should help you get started.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With