Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use NLTK to generate sentences from an induced grammar?

Tags:

I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree:

(ROOT
  (S
    (ADVP (RB Now))
    (, ,)
    (NP (PRP you))
    (VP (MD can)
      (VP (VB be)
        (VP (VBN entertained))))
    (. .)))

I am using the set of sentence trees to induce a grammar using nltk:

import nltk

# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()

# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)

Now I would like to use grammar to generate new, random sentences. My hope is that since the grammar was learned from a specific set of input examples, then the generated sentences will be semantically similar. Can I do this in nltk?

If I can't use nltk to do this, do any other tools exist that can take the (possibly reformatted) grammar and generate sentences?

like image 453
stepthom Avatar asked Feb 21 '13 18:02

stepthom


People also ask

How do you define a grammar in NLTK?

A grammar consists of a start state and a set of productions. The set of terminals and nonterminals is implicitly specified by the productions. If you need efficient key-based access to productions, you can use a subclass to implement it.

What is parsing in NLTK?

NLTK Parsers. Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text's “parses”.

Which one of the following algorithms wastes a lot of time considering words and structures that do not correspond to the input sentence?

Recursive descent parsing has three key shortcomings. First, left-recursive productions like NP -> NP PP send it into an infinite loop. Second, the parser wastes a lot of time considering words and structures that do not correspond to the input sentence.


2 Answers

In NLTK 2.0 you can use nltk.parse.generate to generate all possible sentences for a given grammar.

This code defines a function which should generate a single sentence based on the production rules in a (P)CFG.

# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
    frags = []
    if len(items) == 1:
        if isinstance(items[0], Nonterminal):
            for prod in grammar.productions(lhs=items[0]):
                frags.append(generate_sample(grammar, prod.rhs()))
        else:
            frags.append(items[0])
    else:
        # This is where we need to make our changes
        chosen_expansion = choice(items)
        frags.append(generate_sample,chosen_expansion)
    return frags

To make use of the weights in your PCFG, you'll obviously want to use a better sampling method than choice(), which implicitly assumes all expansions of the current node are equiprobable.

like image 99
dmh Avatar answered Oct 04 '22 17:10

dmh


First of all, if you generate random sentences, they may be semantically correct, but they will probably lose their sense.

(It sounds to me a bit like those MIT students did with their SCIgen program which is auto-generating scientific paper. Very interesting btw.)

Anyway, I never did it myself, but it seems possible with nltk.bigrams, you may way to have a look there under Generating Random Text with Bigrams.

You can also generate all subtrees of a current tree, I'm not sure if it is what you want either.

like image 21
ForceMagic Avatar answered Oct 04 '22 17:10

ForceMagic