Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python and NLTK: How to analyze sentence grammar?

I have this code which should show the syntactic structure of the sentence according to defined grammar. However it is returning an empty []. What am I missing or doing wrong?

import nltk

grammar = nltk.parse_cfg("""
S -> NP VP 
PP -> P NP
NP -> Det N | Det N PP 
VP -> V NP | VP PP
N -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
P -> 'or' | 'and'
""")

def main():
    sent = "Kim arrived or Dana left and everyone cheered".split()
    parser = nltk.ChartParser(grammar)
    trees = parser.nbest_parse(sent)
    for tree in trees:
        print tree

if __name__ == '__main__':
    main()
like image 341
Helena Avatar asked Jan 07 '14 22:01

Helena


People also ask

How do you define a grammar in NLTK?

A grammar consists of a start state and a set of productions. The set of terminals and nonterminals is implicitly specified by the productions. If you need efficient key-based access to productions, you can use a subclass to implement it.

How do you parse the sentence?

Traditionally, parsing is done by taking a sentence and breaking it down into different parts of speech. The words are placed into distinct grammatical categories, and then the grammatical relationships between the words are identified, allowing the reader to interpret the sentence.


2 Answers

Let's do some reverse engineering:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

Seems like the rules can't recognize even the first work as NP. So let's try injecting NP -> N

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP | N
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('NP', [Tree('N', ['Kim'])])]

So now it's working, let's continue Kim arrived or Dana and:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | VP PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

Seem like there is no way to get the VP with or without the P, since V requires either an NP after, or it has to go up the tree to be a VP before taking a P, so it's relax the rules and say VP -> V PP instead of VP -> VP PP:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']), Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]

Okay, we are getting closer, but seems like the next word broke the cfg rules again:

>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana left".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> sent = "Kim arrived or Dana left and".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone cheered".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

So I hope the above example shows you that trying to change the rules to incorporate language phenomenon from left to right is hard.

Instead of doing it from left to right, and achieve

[[[[[[[[Kim] arrived] or] Dana] left] and] everyone] cheered]

why don't you try to make more linguistically sound rules to achieve:

  1. [[[Kim arrived] or [Dana left]] and [everyone cheered]]
  2. [[Kim arrived] or [[Dana left] and [everyone cheered]]]

Try this instead:

import nltk
grammar = nltk.parse_cfg("""
S -> CP | VP 
CP -> VP C VP | CP C VP | VP C CP
VP -> NP V 
NP -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
C -> 'or' | 'and'
""")

print "======= Kim arrived ========="
sent = "Kim arrived".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t

print "\n======= Kim arrived or Dana left ========="
sent = "Kim arrived or Dana left".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t 

print "\n=== Kim arrived or Dana left and everyone cheered ===="
sent = "Kim arrived or Dana left and everyone cheered".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t

[out]:

======= Kim arrived =========
(S (VP (NP Kim) (V arrived)))

======= Kim arrived or Dana left =========
(S (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left))))

=== Kim arrived or Dana left and everyone cheered ====
(S
  (CP
    (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left)))
    (C and)
    (VP (NP everyone) (V cheered))))
(S
  (CP
    (VP (NP Kim) (V arrived))
    (C or)
    (CP
      (VP (NP Dana) (V left))
      (C and)
      (VP (NP everyone) (V cheered)))))

The above solution show how your CFG rules needs to be robust enough to not only capture the full sentence but also part of the sentence too.

like image 184
alvas Avatar answered Nov 15 '22 12:11

alvas


You don't have a Det defined in your grammar, but each NP (and consequently S) has to have one by grammar definition.

Compare with

>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... NP -> Det N | Det N PP
... VP -> V NP | VP PP
... Det -> 'a' | 'the'
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... """)
>>>
>>> parser = nltk.ChartParser(grammar)
>>> parser.nbest_parse('the Kim left a Dana'.split())
[Tree('S', [Tree('NP', [Tree('Det', ['the']), Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['left']), Tree('NP', [Tree('Det', ['a']), Tree('N', ['Dana'])])])])]
like image 23
alko Avatar answered Nov 15 '22 10:11

alko