import nltk
from nltk.parse import ViterbiParser
def pcfg_chartparser(grammarfile):
f=open(grammarfile)
grammar=f.read()
f.close()
return nltk.PCFG.fromstring(grammar)
grammarp = pcfg_chartparser("wsjp.cfg")
VP = ViterbiParser(grammarp)
print VP
for w in sent:
for tree in VP.parse(nltk.word_tokenize(w)):
print tree
When I run the above code, it produces the following output for the sentence, "turn off the lights"-
(S (VP (VB turn) (PRT (RP off)) (NP (DT the) (NNS lights)))) (p=2.53851e-14)
However, it raises the following error for the sentence, "please turn off the lights"-
ValueError: Grammar does not cover some of the input words: u"'please'"
I am building a ViterbiParser by supplying it a probabilistic context free grammar. It works well in parsing sentences that have words which are already in the rules of the grammar. It fails to parse sentences in which the Parser has not seen the word in the grammar rules. How to get around this limitation?
I am referring to this assignment.
Firstly, try to use (i) namespaces and (ii) unequivocal variable names, e.g.:
>>> from nltk import PCFG
>>> from nltk.parse import ViterbiParser
>>> import urllib.request
>>> response = urllib.request.urlopen('https://raw.githubusercontent.com/salmanahmad/6.863/master/Labs/Assignment5/Code/wsjp.cfg')
>>> wsjp = response.read().decode('utf8')
>>> grammar = PCFG.fromstring(wsjp)
>>> parser = ViterbiParser(grammar)
>>> list(parser.parse('turn off the lights'.split()))
[ProbabilisticTree('S', [ProbabilisticTree('VP', [ProbabilisticTree('VB', ['turn']) (p=0.002082678), ProbabilisticTree('PRT', [ProbabilisticTree('RP', ['off']) (p=0.1089101771)]) (p=0.10768769667270556), ProbabilisticTree('NP', [ProbabilisticTree('DT', ['the']) (p=0.7396712852), ProbabilisticTree('NNS', ['lights']) (p=4.61672e-05)]) (p=4.4236397464693323e-07)]) (p=1.0999324002161311e-13)]) (p=2.5385077255727538e-14)]
If we look at the grammar:
>>> grammar.check_coverage('please turn off the lights'.split())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/dist-packages/nltk/grammar.py", line 631, in check_coverage
"input words: %r." % missing)
ValueError: Grammar does not cover some of the input words: "'please'".
To resolve the unknown word issues, there're several options:
Use wildcard
non-terminals nodes to replace the unknown words. Find some way to replace the words that the grammar don't cover from check_coverage()
with the wildcard
, then parse the sentence with the wildcard
Go back to your grammar production file that you have before creating the learning the PCFG with learn_pcfg.py
and add all possible words in the terminal productions.
Add the unknown words into your pcfg grammar and then renormalize the weights, given either very small weights to the unknown words (you can also try smarter smoothing/interpolation techniques)
Since this is a homework question I will not give the answer with the full code. But the hints above should be enough to resolve the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With