Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break/Decompose complex and compound sentences in nltk

Tags:

python

nlp

nltk

Is there a way to decompose complex sentences into simple sentences in nltk or other natural language processing libraries?

For example:

The park is so wonderful when the sun is setting and a cool breeze is blowing ==> The sun is setting. a cool breeze is blowing. The park is so wonderful.

like image 376
Sharmila Avatar asked Feb 03 '23 03:02

Sharmila


1 Answers

This is much more complicated than it seems, so you're unlikely to find a perfectly clean method.

However, using the English parser in OpenNLP, I can take your example sentence and get a following grammar tree:

  (S
    (NP (DT The) (NN park))
    (VP
      (VBZ is)
      (ADJP (RB so) (JJ wonderful))
      (SBAR
        (WHADVP (WRB when))
        (S
          (S (NP (DT the) (NN sun)) (VP (VBZ is) (VP (VBG setting))))
          (CC and)
          (S
            (NP (DT a) (JJ cool) (NN breeze))
            (VP (VBZ is) (VP (VBG blowing)))))))
    (. .)))

From there, you can pick it apart as you like. You can get your sub-clauses by extracting the top-level (NP *)(VP *) minus the (SBAR *) section. And then you could split the conjunction inside the (SBAR *) into the other two statements.

Note, the OpenNLP parser is trained using the Penn Treebank corpus. I obtained a pretty accurate parsing on your example sentence, but the parser isn't perfect and can be wildly wrong on other sentences. Look here for an explanation of its tags. It assumes you already have some basic understanding of linguistics and English grammar.

Edit: Btw, this is how I access OpenNLP from Python. This assumes you have the OpenNLP jar and model files in a opennlp-tools-1.4.3 folder.

import os, sys
from subprocess import Popen, PIPE
import nltk

BP = os.path.dirname(os.path.abspath(__file__))
CP = "%(BP)s/opennlp-tools-1.4.3.jar:%(BP)s/opennlp-tools-1.4.3/lib/maxent-2.5.2.jar:%(BP)s/opennlp-tools-1.4.3/lib/jwnl-1.3.3.jar:%(BP)s/opennlp-tools-1.4.3/lib/trove.jar" % dict(BP=BP)
cmd = "java -cp %(CP)s -Xmx1024m opennlp.tools.lang.english.TreebankParser -k 1 -d %(BP)s/opennlp.models/english/parser" % dict(CP=CP, BP=BP)
p = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True)
stdin, stdout, stderr = (p.stdin, p.stdout, p.stderr)
text = "This is my sample sentence."
stdin.write('%s\n' % text)
ret = stdout.readline()
ret = ret.split(' ')
prob = float(ret[1])
tree = nltk.Tree.parse(' '.join(ret[2:]))
like image 133
Cerin Avatar answered Feb 16 '23 11:02

Cerin