Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching.
How can I combine them in a neat way?
For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example
Rule-based matching can have POS rules?
If not - here is my current plan - gather everything in one string and apply regex
import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)
concatPos = ''
print(text)
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')
# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-
Spacy POS Tagging Example We just instantiate a Spacy object as doc. We iterate over doc object and use pos_ , tag_, to print the POS tag. Spacy also lets you access the detailed explanation of POS tags by using spacy. explain() function which is also printed in the same iteration along with POS tags.
The Matcher lets you find words and phrases using rules describing their token attributes. Rules can refer to token annotations (like the text or part-of-speech tags), as well as lexical attributes like Token. is_punct . Applying the matcher to a Doc gives you access to the matched tokens in context.
The PhraseMatcher lets you efficiently match large terminology lists. While the Matcher lets you match sequences based on lists of token descriptions, the PhraseMatcher accepts match patterns in the form of Doc objects. See the usage guide for examples.
NLP helps you extract insights from unstructured text and has several use cases, such as: Automatic summarization. Named entity recognition. Question answering systems.
Sure, simply use the POS attribute.
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])
doc = nlp(u'what are the main issues')
matches = matcher(doc)
Eyal Shulman's answer was helpful, but it makes you hard code a pattern matcher, not exactly use a regular expression.
I wanted to use regular expressions, so I made my own solution:
pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*'
## create a string with the pos of the sentence
posString = ""
for w in doc[start:end].sent:
posString += "<" + w.pos_ + ">"
lstVerb = []
for m in re.compile(pattern).finditer(posString):
## each m is a verb phrase match
## count the "<" in m to find how many tokens we want
numTokensInGroup = m.group().count('<')
## then find the number of tokens that came before that group.
numTokensBeforeGroup = posString[:m.start()].count('<')
verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
## starting at character offset m.start()
lstVerb.append(verbPhrase)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With