I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree:
(ROOT
(S
(ADVP (RB Now))
(, ,)
(NP (PRP you))
(VP (MD can)
(VP (VB be)
(VP (VBN entertained))))
(. .)))
I am using the set of sentence trees to induce a grammar using nltk:
import nltk
# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()
# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)
Now I would like to use grammar
to generate new, random sentences. My hope is that since the grammar was learned from a specific set of input examples, then the generated sentences will be semantically similar. Can I do this in nltk?
If I can't use nltk to do this, do any other tools exist that can take the (possibly reformatted) grammar
and generate sentences?
A grammar consists of a start state and a set of productions. The set of terminals and nonterminals is implicitly specified by the productions. If you need efficient key-based access to productions, you can use a subclass to implement it.
NLTK Parsers. Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text's “parses”.
Recursive descent parsing has three key shortcomings. First, left-recursive productions like NP -> NP PP send it into an infinite loop. Second, the parser wastes a lot of time considering words and structures that do not correspond to the input sentence.
In NLTK 2.0 you can use nltk.parse.generate
to generate all possible sentences for a given grammar.
This code defines a function which should generate a single sentence based on the production rules in a (P)CFG.
# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
frags = []
if len(items) == 1:
if isinstance(items[0], Nonterminal):
for prod in grammar.productions(lhs=items[0]):
frags.append(generate_sample(grammar, prod.rhs()))
else:
frags.append(items[0])
else:
# This is where we need to make our changes
chosen_expansion = choice(items)
frags.append(generate_sample,chosen_expansion)
return frags
To make use of the weights in your PCFG, you'll obviously want to use a better sampling method than choice()
, which implicitly assumes all expansions of the current node are equiprobable.
First of all, if you generate random sentences, they may be semantically correct, but they will probably lose their sense.
(It sounds to me a bit like those MIT students did with their SCIgen program which is auto-generating scientific paper. Very interesting btw.)
Anyway, I never did it myself, but it seems possible with nltk.bigrams, you may way to have a look there under Generating Random Text with Bigrams.
You can also generate all subtrees of a current tree, I'm not sure if it is what you want either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With