Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Absolute position of leaves in NLTK tree

Tags:

I am trying to find the span (start index, end index) of a noun phrase in a given sentence. The following is the code for extracting noun phrases

sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    VP:
        {<VBD><PP>?}
        {<VBZ><PP>?}
        {<VB><PP>?}
        {<VBN><PP>?}
        {<VBG><PP>?}
        {<VBP><PP>?}
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
  np = ''
  for x in subtree.leaves():
    np = np + ' ' + x[0]
  nounPhrases.append(np.strip())

For a = "The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.", the noun phrases extracted are

['American Civil War', 'War', 'States', 'Civil War', 'civil war fought', 'United States', 'several Southern', 'states', 'secession', 'Confederate States', 'America'].

Now I need to find the span (start position and end position of the phrase) of noun phrases. For example, the span of above noun phrases will be

[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....].

I'm fairly new to NLTK and I've looked into http://www.nltk.org/_modules/nltk/tree.html. I tried to use Tree.treepositions() but I couldn't manage to extract absolute positions using these indices. Any help would be greatly appreciated. Thank You!

like image 567
Corleone Avatar asked Apr 25 '16 02:04

Corleone


1 Answers

There isn't any implicit function that returns the offsets of strings/tokens as highlighted by https://github.com/nltk/nltk/issues/1214

But you can use an ngram searcher that is used by the RIBES score from https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L123

>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43

(It returns the starting position of the query ngram)

like image 154
alvas Avatar answered Sep 28 '22 04:09

alvas