Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK ViterbiParser fails in parsing words that are not in the PCFG rule

import nltk
from nltk.parse import ViterbiParser

def pcfg_chartparser(grammarfile):
    f=open(grammarfile)
    grammar=f.read()
    f.close()
    return nltk.PCFG.fromstring(grammar)

grammarp = pcfg_chartparser("wsjp.cfg")

VP = ViterbiParser(grammarp)
print VP
for w in sent:
    for tree in VP.parse(nltk.word_tokenize(w)):
        print tree

When I run the above code, it produces the following output for the sentence, "turn off the lights"-

(S (VP (VB turn) (PRT (RP off)) (NP (DT the) (NNS lights)))) (p=2.53851e-14)

However, it raises the following error for the sentence, "please turn off the lights"-

ValueError: Grammar does not cover some of the input words: u"'please'"

I am building a ViterbiParser by supplying it a probabilistic context free grammar. It works well in parsing sentences that have words which are already in the rules of the grammar. It fails to parse sentences in which the Parser has not seen the word in the grammar rules. How to get around this limitation?
I am referring to this assignment.

like image 417
Kaushal Avatar asked Jan 30 '16 14:01

Kaushal


1 Answers

Firstly, try to use (i) namespaces and (ii) unequivocal variable names, e.g.:

>>> from nltk import PCFG
>>> from nltk.parse import ViterbiParser
>>> import urllib.request
>>> response = urllib.request.urlopen('https://raw.githubusercontent.com/salmanahmad/6.863/master/Labs/Assignment5/Code/wsjp.cfg')
>>> wsjp = response.read().decode('utf8')
>>> grammar = PCFG.fromstring(wsjp)
>>> parser = ViterbiParser(grammar)
>>> list(parser.parse('turn off the lights'.split()))
[ProbabilisticTree('S', [ProbabilisticTree('VP', [ProbabilisticTree('VB', ['turn']) (p=0.002082678), ProbabilisticTree('PRT', [ProbabilisticTree('RP', ['off']) (p=0.1089101771)]) (p=0.10768769667270556), ProbabilisticTree('NP', [ProbabilisticTree('DT', ['the']) (p=0.7396712852), ProbabilisticTree('NNS', ['lights']) (p=4.61672e-05)]) (p=4.4236397464693323e-07)]) (p=1.0999324002161311e-13)]) (p=2.5385077255727538e-14)]

If we look at the grammar:

>>> grammar.check_coverage('please turn off the lights'.split())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/nltk/grammar.py", line 631, in check_coverage
    "input words: %r." % missing)
ValueError: Grammar does not cover some of the input words: "'please'".

To resolve the unknown word issues, there're several options:

  • Use wildcard non-terminals nodes to replace the unknown words. Find some way to replace the words that the grammar don't cover from check_coverage() with the wildcard, then parse the sentence with the wildcard

    • this will usually decrease the parser's accuracy unless you have specifically train the PCFG with a grammar that handles unknown words and the wildcard is a superset of the unknown words.
  • Go back to your grammar production file that you have before creating the learning the PCFG with learn_pcfg.py and add all possible words in the terminal productions.

  • Add the unknown words into your pcfg grammar and then renormalize the weights, given either very small weights to the unknown words (you can also try smarter smoothing/interpolation techniques)

Since this is a homework question I will not give the answer with the full code. But the hints above should be enough to resolve the problem.

like image 98
alvas Avatar answered Nov 15 '22 14:11

alvas