Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting specific leaf value from nltk tree structure with Python

Tags:

python

tree

nltk

I have some questions about NLTK's tree functions. I am trying to extract a certain word from the tree structure like the one given below.

test = Tree.parse('(ROOT(SBARQ(WHADVP(WRB How))(SQ(VBP do)(NP (PRP you))(VP(VB ask)(NP(DT a)(JJ total)(NN stranger))(PRT (RP out))(PP (IN on)(NP (DT a)(NN date)))))))')

print "Input tree: ", test
print test.leaves()

(SBARQ
    (WHADVP (WRB How))
    (SQ
      (VBP do)
      (NP (PRP you))
      (VP
        (VB ask)
        (NP (DT a) (JJ total) (NN stranger))
        (PRT (RP out))
        (PP (IN on) (NP (DT a) (NN date)))))))

['How', 'do', 'you', 'ask', 'a', 'total', 'stranger', 'out', 'on', 'a', 'date']

I can find a list of all the words using the leaves() function. Is there a way to get a specific leaf only? For example: I would like to get the first/last noun from the NP phrase only? The answer would be 'stranger' for the first noun and 'date' as the last noun.

like image 785
Cryssie Avatar asked May 06 '13 21:05

Cryssie


1 Answers

Although noun phrases can be nested inside other types of phrases, I believe most grammars always have nouns in noun phrases. So your question can probably be rephrased as: How do you find the first and last nouns?

You can simply get all tuples of words and POS tags and filter like this,

>>> [word for word,pos in test.pos() if pos=='NN']
['stranger', 'date']

Which in this case is only two so you're done. If you had more nouns, you would just index the list at [0] and [-1].


If you were looking for another POS that could be used in different phrases but you only wanted its use inside a particular one or if you had a strange grammar that allowed nouns outside of NPs, you can do the following...

You can find subtrees of 'NP' by doing,

>>> NPs = list(test.subtrees(filter=lambda x: x.node=='NP'))
>>> NPs
[Tree('NP', [Tree('PRP', ['you'])]), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['total']), Tree('NN', ['stranger'])]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['date'])])]

Continuing to narrow down the subtrees, we can use this result to look for 'NN' words,

>>> NNs_inside_NPs = map(lambda x: list(x.subtrees(filter=lambda x: x.node=='NN')), NPs)
>>> NNs_inside_NPs
[[], [Tree('NN', ['stranger'])], [Tree('NN', ['date'])]]

So this is a list of lists of all the 'NN's inside each 'NP' phrases. In this case there happens to only be zero or one noun in each phrase.

Now we just need to go through the 'NP's and get all the leaves of the individual nouns (which really means we just want to access the 'stranger' part of Tree('NN', ['stranger'])).

>>> [noun.leaves()[0] for nouns in NNs_inside_NPs for noun in nouns]
['stranger', 'date']
like image 195
Jared Avatar answered Sep 19 '22 01:09

Jared