Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining a Tokenizer into a Grammar and Parser with NLTK

I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar.

My goal is to build a grammar for a particular text corpus.

(Initial question: Should I even try to start a grammar from scratch or should I start with a predefined grammar? If I should start with another grammar, which is a good one to start with for English?)

Suppose I have the following simple grammar:

simple_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP VP -> V NP | VP PP Det -> 'a' | 'A' N -> 'car' | 'door' V -> 'has' P -> 'in' | 'for'  """); 

This grammar can parse a very simple sentence, such as:

parser = nltk.ChartParser(simple_grammar) trees = parser.nbest_parse("A car has a door") 

Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar?

For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?

like image 689
speedplane Avatar asked Feb 01 '11 03:02

speedplane


People also ask

How do you Tokenize a sentence using the nltk package?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

Is word tokenizer split?

In natural language processing, tokenization is the process of breaking human-readable text into machine readable components. The most obvious way to tokenize a text is to split the text into words.

How do you tokenize a string in Python using nltk?

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.

How is str split () different from word tokenizer?

tokenize() ,which returns a list, will ignore empty string (when a delimiter appears twice in succession) where as split() keeps such string. The split() can take regex as delimiter where as tokenize does not.


2 Answers

You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.

> text = nltk.word_tokenize("A car has a door") ['A', 'car', 'has', 'a', 'door']  > tagged_text = nltk.pos_tag(text) [('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]  > pos_tags = [pos for (token,pos) in nltk.pos_tag(text)] ['DT', 'NN', 'VBZ', 'DT', 'NN']  > simple_grammar = nltk.parse_cfg("""   S -> NP VP   PP -> P NP   NP -> Det N | Det N PP   VP -> V NP | VP PP   Det -> 'DT'   N -> 'NN'   V -> 'VBZ'   P -> 'PP'   """)  > parser = nltk.ChartParser(simple_grammar) > tree = parser.parse(pos_tags) 
like image 191
Stompchicken Avatar answered Sep 22 '22 03:09

Stompchicken


I know this is a year later but I wanted to add some thoughts.

I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence).

Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags.

So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file.

The form I use to print it works perfectly with setting up a parser through loading a grammar file (parser = nltk.load_parser('grammar.cfg')). One of the lines it generates looks like this:

VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...

So now my grammar has the actual words as terminals and assigns the same tags that nltk.tag_pos does.

Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar.

import nltk from collections import defaultdict  tag_dict = defaultdict(list)  ...     """ (Looping through sentences) """          # Tag         tagged_sent = nltk.pos_tag(tokens)          # Put tags and words into the dictionary         for word, tag in tagged_sent:             if tag not in tag_dict:                 tag_dict[tag].append(word)             elif word not in tag_dict.get(tag):                 tag_dict[tag].append(word)  # Printing to screen for tag, words in tag_dict.items():     print tag, "->",     first_word = True     for word in words:         if first_word:             print "\"" + word + "\"",             first_word = False         else:             print "| \"" + word + "\"",     print '' 
like image 45
John Avatar answered Sep 26 '22 03:09

John