how to write spacy matcher of POS regex

Tags:

spacy

Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching.

How can I combine them in a neat way?

For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example

Rule-based matching can have POS rules?

If not - here is my current plan - gather everything in one string and apply regex

    import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)

concatPos = ''
print(text)
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
    concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')

# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-

273

asked Mar 16 '17 09:03

user1025852

2 Answers

Sure, simply use the POS attribute.

import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])

doc = nlp(u'what are the main issues')
matches = matcher(doc)

101

answered Oct 21 '22 16:10

Eyal Shulman

Eyal Shulman's answer was helpful, but it makes you hard code a pattern matcher, not exactly use a regular expression.

I wanted to use regular expressions, so I made my own solution:

    pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*' 
    ## create a string with the pos of the sentence
    posString = ""
    for w in doc[start:end].sent:
        posString += "<" + w.pos_ + ">"

    lstVerb = []
    for m in re.compile(pattern).finditer(posString):
        ## each m is a verb phrase match
        ## count the "<" in m to find how many tokens we want
        numTokensInGroup = m.group().count('<')

        ## then find the number of tokens that came before that group.
        numTokensBeforeGroup = posString[:m.start()].count('<') 

        verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
        ## starting at character offset m.start()
        lstVerb.append(verbPhrase)

answered Oct 21 '22 17:10

Joshua Stafford

Related questions
                            
                                Extracting the person names in the named entity recognition in NLP using Python
                            
                                Train Spacy NER on Indian Names
                            
                                Spacy - nlp.pipe() returns generator
                            
                                Lemmatize a doc with spacy?
                            
                                How can a machine learning model handle unseen data and unseen label?
                            
                                How to get token ids using spaCy (I want to map a text sentence to sequence of integers)
                            
                                `return_sequences = False` equivalent in pytorch LSTM
                            
                                How to find singular in the plural when some letters change? What is the best approach?
                            
                                Natural Language Processing Package
                            
                                Anyone know of some good Word Sense Disambiguation software? [closed]
                            
                                stanford Core NLP: Splitting sentences from text
                            
                                Algorithm to generate context free grammar from any regex
                            
                                Lexicon dictionary for synonym words
                            
                                Difference between Semantic Web and NLP?
                            
                                How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?
                            
                                No such file or directory 'nltk_data/corpora/stopwords/English' when using colab
                            
                                Spacy similarity warning : "Evaluating Doc.similarity based on empty vectors."
                            
                                How nltk.TweetTokenizer different from nltk.word_tokenize?
                            
                                How to create the negative of a sentence in nltk
                            
                                What is Two-Level Morphology?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With