NLTK's chunk parser's regular expressions can match POS tags, but can they also match specific words?
So, suppose I want to chunk any structure with a noun followed by the verb "left" (call this pattern L). For example, the sentence "the\DT dog\NN left\VB" should be chunked as
(S (DT the) (L (NN dog) (VB left))), but the sentence "the\DT dog\NN slept\VB" wouldn't be chunked at all.
I haven't been able to find any documentation on the chunking regex syntax, and all examples I've seen only match POS tags.
I had a similar problem and after realizing that the regex pattern will only examine tags, I changed the tag on the the piece I was interested in.
For example, I was trying to match product name and version and using a chunk rule like \NNP+\CD worked for "Internet Explorer 8.0" but failed on "Internet Explorer 8.0 SP2" where it tagged SP2 as a NNP.
Perhaps I could have trained a POS tagger but decided instead to just change the tag to SP and then a chunk rule like \NNP+\CD\SP* will match either example.
The easiest way is to convert the tags of the words. Modify the tag of the word you want to use in the regular expression.
Example:
import nltk
pos_tags = nltk.pos_tag(nltk.word_tokenize('Dog slept all night. Dog left at 8pm.'))
# modify tags for the words we want to use in regular expression
pos_tags = [
(w, 'LEFT') if w == 'left' else (w, t)
for w, t in pos_tags
]
grammar = "CHUNK: {<NN.*> <LEFT>}"
tree = nltk.RegexpParser(grammar).parse(pos_tags)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With