I am trying to find the span (start index, end index) of a noun phrase in a given sentence. The following is the code for extracting noun phrases
sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
VP:
{<VBD><PP>?}
{<VBZ><PP>?}
{<VB><PP>?}
{<VBN><PP>?}
{<VBG><PP>?}
{<VBP><PP>?}
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
np = ''
for x in subtree.leaves():
np = np + ' ' + x[0]
nounPhrases.append(np.strip())
For a = "The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.", the noun phrases extracted are
['American Civil War', 'War', 'States', 'Civil War', 'civil war fought', 'United States', 'several Southern', 'states', 'secession', 'Confederate States', 'America'].
Now I need to find the span (start position and end position of the phrase) of noun phrases. For example, the span of above noun phrases will be
[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....].
I'm fairly new to NLTK and I've looked into http://www.nltk.org/_modules/nltk/tree.html. I tried to use Tree.treepositions() but I couldn't manage to extract absolute positions using these indices. Any help would be greatly appreciated. Thank You!
There isn't any implicit function that returns the offsets of strings/tokens as highlighted by https://github.com/nltk/nltk/issues/1214
But you can use an ngram searcher that is used by the RIBES score from https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L123
>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43
(It returns the starting position of the query ngram)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With