I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar.
My goal is to build a grammar for a particular text corpus.
(Initial question: Should I even try to start a grammar from scratch or should I start with a predefined grammar? If I should start with another grammar, which is a good one to start with for English?)
Suppose I have the following simple grammar:
simple_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP VP -> V NP | VP PP Det -> 'a' | 'A' N -> 'car' | 'door' V -> 'has' P -> 'in' | 'for' """);
This grammar can parse a very simple sentence, such as:
parser = nltk.ChartParser(simple_grammar) trees = parser.nbest_parse("A car has a door")
Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar?
For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?
NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
In natural language processing, tokenization is the process of breaking human-readable text into machine readable components. The most obvious way to tokenize a text is to split the text into words.
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
tokenize() ,which returns a list, will ignore empty string (when a delimiter appears twice in succession) where as split() keeps such string. The split() can take regex as delimiter where as tokenize does not.
You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.
> text = nltk.word_tokenize("A car has a door") ['A', 'car', 'has', 'a', 'door'] > tagged_text = nltk.pos_tag(text) [('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')] > pos_tags = [pos for (token,pos) in nltk.pos_tag(text)] ['DT', 'NN', 'VBZ', 'DT', 'NN'] > simple_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP VP -> V NP | VP PP Det -> 'DT' N -> 'NN' V -> 'VBZ' P -> 'PP' """) > parser = nltk.ChartParser(simple_grammar) > tree = parser.parse(pos_tags)
I know this is a year later but I wanted to add some thoughts.
I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence).
Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags.
So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file.
The form I use to print it works perfectly with setting up a parser through loading a grammar file (parser = nltk.load_parser('grammar.cfg')
). One of the lines it generates looks like this:
VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...
So now my grammar has the actual words as terminals and assigns the same tags that nltk.tag_pos
does.
Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar.
import nltk from collections import defaultdict tag_dict = defaultdict(list) ... """ (Looping through sentences) """ # Tag tagged_sent = nltk.pos_tag(tokens) # Put tags and words into the dictionary for word, tag in tagged_sent: if tag not in tag_dict: tag_dict[tag].append(word) elif word not in tag_dict.get(tag): tag_dict[tag].append(word) # Printing to screen for tag, words in tag_dict.items(): print tag, "->", first_word = True for word in words: if first_word: print "\"" + word + "\"", first_word = False else: print "| \"" + word + "\"", print ''
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With