Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK Context Free Grammar Genaration

I'm working on a non-English parser with Unicode characters. For that, I decided to use NLTK.

But it requires a predefined context-free grammar as below:

  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with" 

In my app, I am supposed to minimize hard coding with the use of a rule-based grammar. For example, I can assume any word ending with -ed or -ing as a verb. So it should work for any given context.

How can I feed such grammar rules to NLTK? Or generate them dynamically using Finite State Machine?

like image 700
ChamingaD Avatar asked Jul 17 '13 09:07

ChamingaD


3 Answers

If you are creating a parser, then you have to add a step of pos-tagging before the actual parsing -- there is no way to successfully determine the POS-tag of a word out of context. For example, 'closed' can be an adjective or a verb; a POS-tagger will find out the correct tag for you from the context of the word. Then you can use the output of the POS-tagger to create your CFG.

You can use one of the many existing POS-taggers. In NLTK, you can simply do something like:

import nltk
input_sentence = "Dogs chase cats"
text = nltk.word_tokenize(input_sentence)
list_of_tokens = nltk.pos_tag(text)
print list_of_tokens

The output will be:

[('Dogs', 'NN'), ('chase', 'VB'), ('cats', 'NN')]

which you can use to create a grammar string and feed it to nltk.parse_cfg().

like image 198
dkar Avatar answered Sep 20 '22 18:09

dkar


Maybe you're looking for CFG.fromstring() (formerly parse_cfg())?

From Chapter 7 of the NLTK book (updated to NLTK 3.0):

> grammar = nltk.CFG.fromstring("""
 S -> NP VP
 VP -> V NP | V NP PP
 V -> "saw" | "ate"
 NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
 Det -> "a" | "an" | "the" | "my"
 N -> "dog" | "cat" | "cookie" | "park"
 PP -> P NP
 P -> "in" | "on" | "by" | "with"
 """)

> sent = "Mary saw Bob".split()
> rd_parser = nltk.RecursiveDescentParser(grammar)
> for p in rd_parser.parse(sent):
      print p
(S (NP Mary) (VP (V saw) (NP Bob)))
like image 3
arturomp Avatar answered Sep 22 '22 18:09

arturomp


You can use NLTK RegexTagger that have regular expression capability of decide token. This is exactly you need need in your case. As token ending with 'ing' will be tagged as gerunds and token ending with 'ed' will be tagged with verb past. see the example below.

patterns = [
    (r'.*ing$', 'VBG'), # gerunds
    (r'.*ed$', 'VBD'), # simple past
    (r'.*es$', 'VBZ'), # 3rd singular present
    (r'.*ould$', 'MD'), # modals
    (r'.*\'s$', 'NN$'), # possessive nouns
    (r'.*s$', 'NNS') # plural nouns
 ]

Note that these are processed in order, and the first one that matches is applied. Now we can set up a tagger and use it to tag a sentence. After this step, it is correct about a fifth of the time.

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(your_sent)

you can use Combining Taggers for using collectively multiple tagger in a sequence.

like image 1
Sanjiv Avatar answered Sep 19 '22 18:09

Sanjiv