Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding head of a noun phrase in NLTK and stanford parse according to the rules of finding head of a NP

generally A head of a nounphrase is a noun which is rightmost of the NP as shown below tree is the head of the parent NP. So

            ROOT                             
             |                                
             S                               
          ___|________________________        
         NP                           |      
      ___|_____________               |       
     |                 PP             VP     
     |             ____|____      ____|___    
     NP           |         NP   |       PRT 
  ___|_______     |         |    |        |   
 DT  JJ  NN  NN   IN       NNP  VBD       RP 
 |   |   |   |    |         |    |        |   
The old oak tree from     India fell     down

Out[40]: Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['old']), Tree('NN', ['oak']), Tree('NN', ['tree'])]), Tree('PP', [Tree('IN', ['from']), Tree('NP', [Tree('NNP', ['India'])])])]), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['down'])])])])

The following code based on a java implementation uses a simplistic rule to find the head of the NP , but i need to be based on the rules:

parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
    try:
        t.label()
    except AttributeError:
          return
    else:
        if t.label()=='NP':
            print 'NP:'+str(t.leaves())
            print 'NPhead:'+str(t.leaves()[-1])
            for child in t:
                 traverse(child)

        else:
            for child in t:
                traverse(child)


tree=Tree.fromstring(parsestr)
traverse(tree)

The above code gives output:

NP:['The', 'old', 'oak', 'tree', 'from', 'India'] NPhead:India NP:['The', 'old', 'oak', 'tree'] NPhead:tree NP:['India'] NPhead:India

Although now its giving correct output for the sentence given but I need to incorporate a condition that only right most noun is extracted as head , currently it does not check if it were a noun (NN)

print 'NPhead:'+str(t.leaves()[-1])

So something like following in the np head condition in above code:

t.leaves().getrightmostnoun() 

Michael Collins dissertation (Appendix A) includes head-finding rules for the Penn Treebank, and hence it is not necessary that only the rightmost noun is the head. Hence the above conditions should incorporate such scenario.

For the following example as given in one of the answers:

(NP (NP the person) that gave (NP the talk)) went home

The head noun of the subject is person but the last leave node of the NP the person that gave the talk is talk.

like image 354
stackit Avatar asked Sep 18 '15 14:09

stackit


People also ask

What is parsing in NLTK?

NLTK Parsers. Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text's “parses”.

What is noun phrase extraction in NLP?

Noun Phrase ExtractionThe form of n-gram that takes center stage in NLP context analysis is the noun phrase. Noun phrases are part of speech patterns that include a noun. They can also include whatever other parts of speech make grammatical sense, and can include multiple nouns.

What are noun chunks?

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world's largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.


1 Answers

There are built-in string to Tree object in NLTK (http://www.nltk.org/_modules/nltk/tree.html), see https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541.

>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i
... 
(NP
  (NP (DT The) (JJ old) (NN oak) (NN tree))
  (PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))


>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()
... 
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']

Note that it's not always the case that right most noun is the head noun of an NP, e.g.

>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()[-1]
... 
Magnificent
talk

Arguably, Magnificent can still be the head noun. Another example is when the NP includes a relative clause:

(NP (NP the person) that gave (NP the talk)) went home

The head noun of the subject is person but the last leave node of the NP the person that gave the talk is talk.

like image 54
alvas Avatar answered Sep 25 '22 16:09

alvas