Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to write spacy matcher of POS regex

Tags:

nlp

spacy

Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching.

How can I combine them in a neat way?

For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example

Rule-based matching can have POS rules?

If not - here is my current plan - gather everything in one string and apply regex

    import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)

concatPos = ''
print(text)
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
    concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')

# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-
like image 273
user1025852 Avatar asked Mar 16 '17 09:03

user1025852


People also ask

How do you POS tag with spaCy?

Spacy POS Tagging Example We just instantiate a Spacy object as doc. We iterate over doc object and use pos_ , tag_, to print the POS tag. Spacy also lets you access the detailed explanation of POS tags by using spacy. explain() function which is also printed in the same iteration along with POS tags.

What is matcher in spaCy?

The Matcher lets you find words and phrases using rules describing their token attributes. Rules can refer to token annotations (like the text or part-of-speech tags), as well as lexical attributes like Token. is_punct . Applying the matcher to a Doc gives you access to the matched tokens in context.

How does spaCy PhraseMatcher work?

The PhraseMatcher lets you efficiently match large terminology lists. While the Matcher lets you match sequences based on lists of token descriptions, the PhraseMatcher accepts match patterns in the form of Doc objects. See the usage guide for examples.

What does NLP () do in spaCy?

NLP helps you extract insights from unstructured text and has several use cases, such as: Automatic summarization. Named entity recognition. Question answering systems.


2 Answers

Sure, simply use the POS attribute.

import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])

doc = nlp(u'what are the main issues')
matches = matcher(doc)
like image 101
Eyal Shulman Avatar answered Oct 21 '22 16:10

Eyal Shulman


Eyal Shulman's answer was helpful, but it makes you hard code a pattern matcher, not exactly use a regular expression.

I wanted to use regular expressions, so I made my own solution:

    pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*' 
    ## create a string with the pos of the sentence
    posString = ""
    for w in doc[start:end].sent:
        posString += "<" + w.pos_ + ">"

    lstVerb = []
    for m in re.compile(pattern).finditer(posString):
        ## each m is a verb phrase match
        ## count the "<" in m to find how many tokens we want
        numTokensInGroup = m.group().count('<')

        ## then find the number of tokens that came before that group.
        numTokensBeforeGroup = posString[:m.start()].count('<') 

        verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
        ## starting at character offset m.start()
        lstVerb.append(verbPhrase)
like image 25
Joshua Stafford Avatar answered Oct 21 '22 17:10

Joshua Stafford