How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

Tags:

I need to extract item combination from 2 lists by means of python Spacy Matcher. The problem is following: Let us have 2 lists:

colors=['red','bright red','black','brown','dark brown']
animals=['fox','bear','hare','squirrel','wolf']

I match the sequences by the following code:

first_color=[]
last_color=[]
only_first_color=[]
for color in colors:
    if ' ' in color:
        first_color.append(color.split(' ')[0])
        last_color.append(color.split(' ')[1])
    else:
        only_first_color.append(color)
matcher = Matcher(nlp.vocab)

pattern1 = [{"TEXT": {"IN": only_first_color}},{"TEXT":{"IN": animals}}]
pattern2 = [{"TEXT": {"IN": first_color}},{"TEXT": {"IN": last_color}},{"TEXT":{"IN": animals}}]

matcher.add("ANIMALS", None, pattern1,pattern2)

doc = nlp('bright red fox met black wolf')

matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

It gives the output:

0 3 bright red fox
1 3 red fox
4 6 black wolf

How can i extract only 'bright red fox' and 'black wolf'? Should i change the patterns rules or post-process the matches?

Any thoughts appreciate!

611

asked Aug 07 '20 12:08

Victoria

1 Answers

You may use spacy.util.filter_spans:

Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Python code:

matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.start, span.end, span.text)

Output:

0 3 bright red fox
4 6 black wolf

122

answered Oct 24 '22 02:10

Wiktor Stribiżew

Related questions
                            
                                KeyError: ''val_loss" when training model
                            
                                Pyinstaller win32ctypes.pywin32.pywintypes.error: (1920, 'LoadLibraryExW', 'System cannot access the file')
                            
                                Parsing fields in django-import-export before importing
                            
                                Tox fails because setup.py can't find the requirements.txt
                            
                                How do I get my accounts' positions at Interactive Brokers using Python API?
                            
                                How do i automatically update a dropdown selection widget when another selection widget is changed? (Python panel pyviz)
                            
                                How to use pipreqs to create requirements.txt file
                            
                                Learning rate of custom training loop for tensorflow 2.0
                            
                                How do I save custom information to a PNG Image file in Python?
                            
                                How can I use tf.keras.Model.summary to see the layers of a child model which in a father model?
                            
                                PyTorch: Learning rate scheduler
                            
                                Removed Realm, but still getting this error: module importing failed: invalid token (rlm_lldb.py, line 37) File "temp.py", line 1,
                            
                                Cleaning image for OCR
                            
                                How can I create a small IDLE-like Python Shell in Tkinter?
                            
                                Integrating Dash apps into Flask: minimal example
                            
                                SecurityError: Failed to establish secure connection to 'EOF occurred in violation of protocol (_ssl.c:841)'
                            
                                Cannot import name 'easter' from 'holidays'
                            
                                python3.8 fails with "Fatal Python error: config_get_locale_encoding"
                            
                                Actions for creating venv in python and clone a git repo
                            
                                How to use multiple inputs in Tensorflow 2.x Keras Custom Layer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

Tags:

python

nlp

matcher

spacy

Victoria

People also ask

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us