Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get a set of grammar rules from Penn Treebank using python & NLTK?

I'm fairly new to NLTK and Python. I've been creating sentence parses using the toy grammars given in the examples but I would like to know if it's possible to use a grammar learned from a portion of the Penn Treebank, say, as opposed to just writing my own or using the toy grammars? (I'm using Python 2.7 on Mac) Many thanks

like image 572
Matt Robinson Avatar asked Aug 14 '11 13:08

Matt Robinson


2 Answers

If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal

tbank_productions = set(production for sent in treebank.parsed_sents()
                        for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))

This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.

Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.

Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):

mini_grammar = ContextFreeGrammar(Nonterminal('S'),
                                  treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])
like image 62
Constantine Avatar answered Oct 04 '22 13:10

Constantine


It is possible to train a Chunker on the treebank_chunk or conll2000 corpora. You don't get a grammar out of it, but you do get a pickle-able object that can parse phrase chunks. See How to Train a NLTK Chunker, Chunk Extraction with NLTK, and NLTK Classified Based Chunker Accuracy.

like image 21
Jacob Avatar answered Oct 04 '22 13:10

Jacob