Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract verb phrases using Spacy

Tags:

python

spacy

I have been using Spacy for noun chunks extraction using Doc.noun_chunks property provided by Spacy. How could I extract verb phrases from input text using Spacy library (of the form 'VERB ? ADV * VERB +' )?

like image 677
Nidhi Avatar asked Dec 17 '17 14:12

Nidhi


2 Answers

This might help you.

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
    print(list.text)

Output:

is writing

On how to highlight the verb phrases do check the link below.

Highlight verb phrases using spacy and html

Another Approach:

Recently observed Textacy has made some changes to regex matches. Based on that approach i tried this way.

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. He dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]
doc = textacy.make_spacy_doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.matches(doc, pattern)
for list in lists:
    print(list.text)

Output:

sat
jumped
writing

I checked the POS matches in this links seems the result is not the intended one.

[https://explosion.ai/demos/matcher][1]

Did anybody try framing POS tags instead of Regexp pattern for finding Verb phrases?

Edit 2:

import spacy   
from spacy.matcher import Matcher
from spacy.util import filter_spans

nlp = spacy.load('en_core_web_sm') 

sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'AUX', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]

# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", None, pattern)

doc = nlp(sentence) 
# call the matcher to find matches 
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]

print (filter_spans(spans))   

Output:

[sat, quickly ran, jumped, is writing]

Based on help from mdmjsh's answer.

Edit3: Strange behavior. The following sentence for the following pattern the verb phrase gets identified correctly in https://explosion.ai/demos/matcher

pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]

The very black cat must be really meowing really loud in the yard.

But outputs the following while running from code.

[must, really meowing]

like image 142
Programmer_nltk Avatar answered Oct 17 '22 15:10

Programmer_nltk


The above answer references textacy, this is all achievable with Spacy directly with the Matcher, no need for the wrapper library.

import spacy   
from spacy.matcher import Matcher                                                                                                                                                                                         
nlp = spacy.load('en_core_web_sm')  # download model first

sentence = 'The author was staring pensively as she wrote' 

pattern=[{'POS': 'VERB', 'OP': '?'},
 {'POS': 'ADV', 'OP': '*'},
 {'OP': '*'}, # additional wildcard - match any text in between
 {'POS': 'VERB', 'OP': '+'}]

# instantiate a Matcher instance
matcher = Matcher(nlp.vocab) 

# Add pattern to matcher
matcher.add("verb-phrases", None, pattern)
doc = nlp(sentence) 
# call the matcher to find matches 
matches = matcher(doc) 

N.b. this returns a list of tuples containing the match ID and the start, end index for each match, e.g.:

[(15658055046270554203, 0, 4),
 (15658055046270554203, 1, 4),
 (15658055046270554203, 2, 4),
 (15658055046270554203, 3, 4),
 (15658055046270554203, 0, 8),
 (15658055046270554203, 1, 8),
 (15658055046270554203, 2, 8),
 (15658055046270554203, 3, 8),
 (15658055046270554203, 4, 8),
 (15658055046270554203, 5, 8),
 (15658055046270554203, 6, 8),
 (15658055046270554203, 7, 8)]

You can turn these matches into spans using the indexes.

spans = [doc[start:end] for _, start, end in matches] 

# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""   

Note, the I added the additional {'OP': '*'}, to the pattern which serves as a wildcard when note specified with a specific POS/DEP (i.e. it will match any text). This is useful here as the question is about verb phrases - the format VERB, ADV, VERB is an unusual structure (try to think of some example sentences), however VERB, ADV, [other text], VERB is likely (as given in the example sentence 'The author was staring pensively as she wrote'). Optionally, you can refine the pattern to be more specific (displacy is your friend here).

Further Note, all permutations of the match are returned due to the greediness of the matcher. you can optionally reduce this to just the longest form using filter_spans to remove duplicates or overlaps.


from spacy.util import filter_spans                                                                                                                                                                                       

filter_spans(spans)    
# output                                                                                                                                                                                                   
[The author was staring pensively as she wrote]
like image 8
mdmjsh Avatar answered Oct 17 '22 16:10

mdmjsh