Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching words with NLTK's chunk parser

Tags:

python

nltk

NLTK's chunk parser's regular expressions can match POS tags, but can they also match specific words?
So, suppose I want to chunk any structure with a noun followed by the verb "left" (call this pattern L). For example, the sentence "the\DT dog\NN left\VB" should be chunked as
(S (DT the) (L (NN dog) (VB left))), but the sentence "the\DT dog\NN slept\VB" wouldn't be chunked at all.

I haven't been able to find any documentation on the chunking regex syntax, and all examples I've seen only match POS tags.

like image 967
CromTheDestroyer Avatar asked Nov 20 '11 21:11

CromTheDestroyer


2 Answers

I had a similar problem and after realizing that the regex pattern will only examine tags, I changed the tag on the the piece I was interested in.

For example, I was trying to match product name and version and using a chunk rule like \NNP+\CD worked for "Internet Explorer 8.0" but failed on "Internet Explorer 8.0 SP2" where it tagged SP2 as a NNP.

Perhaps I could have trained a POS tagger but decided instead to just change the tag to SP and then a chunk rule like \NNP+\CD\SP* will match either example.

like image 54
Spaceghost Avatar answered Oct 03 '22 23:10

Spaceghost


The easiest way is to convert the tags of the words. Modify the tag of the word you want to use in the regular expression.

Example:

import nltk

pos_tags = nltk.pos_tag(nltk.word_tokenize('Dog slept all night. Dog left at 8pm.'))

# modify tags for the words we want to use in regular expression
pos_tags = [
    (w, 'LEFT') if w == 'left' else (w, t)
    for w, t in pos_tags
]

grammar = "CHUNK: {<NN.*> <LEFT>}"
tree = nltk.RegexpParser(grammar).parse(pos_tags)
like image 44
Pratyush Avatar answered Oct 03 '22 22:10

Pratyush