I need to extract item combination from 2 lists by means of python Spacy Matcher. The problem is following: Let us have 2 lists:
colors=['red','bright red','black','brown','dark brown']
animals=['fox','bear','hare','squirrel','wolf']
I match the sequences by the following code:
first_color=[]
last_color=[]
only_first_color=[]
for color in colors:
if ' ' in color:
first_color.append(color.split(' ')[0])
last_color.append(color.split(' ')[1])
else:
only_first_color.append(color)
matcher = Matcher(nlp.vocab)
pattern1 = [{"TEXT": {"IN": only_first_color}},{"TEXT":{"IN": animals}}]
pattern2 = [{"TEXT": {"IN": first_color}},{"TEXT": {"IN": last_color}},{"TEXT":{"IN": animals}}]
matcher.add("ANIMALS", None, pattern1,pattern2)
doc = nlp('bright red fox met black wolf')
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(start, end, span.text)
It gives the output:
0 3 bright red fox
1 3 red fox
4 6 black wolf
How can i extract only 'bright red fox' and 'black wolf'? Should i change the patterns rules or post-process the matches?
Any thoughts appreciate!
spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_ , and flags like IS_PUNCT ).
The PhraseMatcher lets you efficiently match large terminology lists. While the Matcher lets you match sequences based on lists of token descriptions, the PhraseMatcher accepts match patterns in the form of Doc objects.
Unlike regular expression's fixed pattern matching, this helps us match token, phrases and entities of words and sentences according to some pre-set patterns along with the features such as parts-of-speech, entity types, dependency parsing, lemmatization and many more.
You may use spacy.util.filter_spans
:
Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the (first) longest span is preferred over shorter spans.
Python code:
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
print(span.start, span.end, span.text)
Output:
0 3 bright red fox
4 6 black wolf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With