I have some questions about NLTK's tree functions. I am trying to extract a certain word from the tree structure like the one given below.
test = Tree.parse('(ROOT(SBARQ(WHADVP(WRB How))(SQ(VBP do)(NP (PRP you))(VP(VB ask)(NP(DT a)(JJ total)(NN stranger))(PRT (RP out))(PP (IN on)(NP (DT a)(NN date)))))))')
print "Input tree: ", test
print test.leaves()
(SBARQ
(WHADVP (WRB How))
(SQ
(VBP do)
(NP (PRP you))
(VP
(VB ask)
(NP (DT a) (JJ total) (NN stranger))
(PRT (RP out))
(PP (IN on) (NP (DT a) (NN date)))))))
['How', 'do', 'you', 'ask', 'a', 'total', 'stranger', 'out', 'on', 'a', 'date']
I can find a list of all the words using the leaves() function. Is there a way to get a specific leaf only? For example: I would like to get the first/last noun from the NP phrase only? The answer would be 'stranger' for the first noun and 'date' as the last noun.
Although noun phrases can be nested inside other types of phrases, I believe most grammars always have nouns in noun phrases. So your question can probably be rephrased as: How do you find the first and last nouns?
You can simply get all tuple
s of words and POS tags and filter like this,
>>> [word for word,pos in test.pos() if pos=='NN']
['stranger', 'date']
Which in this case is only two so you're done. If you had more nouns, you would just index the list at [0]
and [-1]
.
If you were looking for another POS that could be used in different phrases but you only wanted its use inside a particular one or if you had a strange grammar that allowed nouns outside of NPs, you can do the following...
You can find subtrees
of 'NP'
by doing,
>>> NPs = list(test.subtrees(filter=lambda x: x.node=='NP'))
>>> NPs
[Tree('NP', [Tree('PRP', ['you'])]), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['total']), Tree('NN', ['stranger'])]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['date'])])]
Continuing to narrow down the subtrees, we can use this result to look for 'NN'
words,
>>> NNs_inside_NPs = map(lambda x: list(x.subtrees(filter=lambda x: x.node=='NN')), NPs)
>>> NNs_inside_NPs
[[], [Tree('NN', ['stranger'])], [Tree('NN', ['date'])]]
So this is a list
of list
s of all the 'NN'
s inside each 'NP'
phrases. In this case there happens to only be zero or one noun in each phrase.
Now we just need to go through the 'NP'
s and get all the leaves
of the individual nouns (which really means we just want to access the 'stranger'
part of Tree('NN', ['stranger'])
).
>>> [noun.leaves()[0] for nouns in NNs_inside_NPs for noun in nouns]
['stranger', 'date']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With