NLTK Context Free Grammar Genaration

Question

I'm working on a non-English parser with Unicode characters. For that, I decided to use NLTK.

But it requires a predefined context-free grammar as below:

  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"

In my app, I am supposed to minimize hard coding with the use of a rule-based grammar. For example, I can assume any word ending with -ed or -ing as a verb. So it should work for any given context.

How can I feed such grammar rules to NLTK? Or generate them dynamically using Finite State Machine?

dkar · Accepted Answer

If you are creating a parser, then you have to add a step of pos-tagging before the actual parsing -- there is no way to successfully determine the POS-tag of a word out of context. For example, 'closed' can be an adjective or a verb; a POS-tagger will find out the correct tag for you from the context of the word. Then you can use the output of the POS-tagger to create your CFG.

You can use one of the many existing POS-taggers. In NLTK, you can simply do something like:

import nltk
input_sentence = "Dogs chase cats"
text = nltk.word_tokenize(input_sentence)
list_of_tokens = nltk.pos_tag(text)
print list_of_tokens

The output will be:

[('Dogs', 'NN'), ('chase', 'VB'), ('cats', 'NN')]

which you can use to create a grammar string and feed it to nltk.parse_cfg().

arturomp · Answer

Maybe you're looking for CFG.fromstring() (formerly parse_cfg())?

From Chapter 7 of the NLTK book (updated to NLTK 3.0):

> grammar = nltk.CFG.fromstring("""
 S -> NP VP
 VP -> V NP | V NP PP
 V -> "saw" | "ate"
 NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
 Det -> "a" | "an" | "the" | "my"
 N -> "dog" | "cat" | "cookie" | "park"
 PP -> P NP
 P -> "in" | "on" | "by" | "with"
 """)

> sent = "Mary saw Bob".split()
> rd_parser = nltk.RecursiveDescentParser(grammar)
> for p in rd_parser.parse(sent):
      print p
(S (NP Mary) (VP (V saw) (NP Bob)))

Sanjiv · Answer

You can use NLTK RegexTagger that have regular expression capability of decide token. This is exactly you need need in your case. As token ending with 'ing' will be tagged as gerunds and token ending with 'ed' will be tagged with verb past. see the example below.

patterns = [
    (r'.*ing$', 'VBG'), # gerunds
    (r'.*ed$', 'VBD'), # simple past
    (r'.*es$', 'VBZ'), # 3rd singular present
    (r'.*ould$', 'MD'), # modals
    (r'.*\'s$', 'NN$'), # possessive nouns
    (r'.*s$', 'NNS') # plural nouns
 ]

Note that these are processed in order, and the first one that matches is applied. Now we can set up a tagger and use it to tag a sentence. After this step, it is correct about a fifth of the time.

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(your_sent)

you can use Combining Taggers for using collectively multiple tagger in a sequence.

NLTK Context Free Grammar Genaration

Tags:

python

parsing

nlp

context-free-grammar

nltk

ChamingaD

3 Answers

dkar

arturomp

Sanjiv

Recent Activity

Donate For Us

NLTK Context Free Grammar Genaration

Tags:

python

parsing

nlp

context-free-grammar

nltk

ChamingaD

3 Answers

dkar

arturomp

Sanjiv

Related questions

Recent Activity

Donate For Us