I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar. My goal is to build a grammar for a particular text corpus. (Initial question: Should I even try to start a grammar from scratch or should I start with a predefined grammar? If I should start with another grammar, which is a good one to start with for English?) Suppose I have the following simple grammar: <pre class="prettyprint"><code>simple_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP VP -> V NP | VP PP Det -> 'a' | 'A' N -> 'car' | 'door' V -> 'has' P -> 'in' | 'for' """); </code></pre> This grammar can parse a very simple sentence, such as: <pre class="prettyprint"><code>parser = nltk.ChartParser(simple_grammar) trees = parser.nbest_parse("A car has a door") </code></pre> Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar? For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?

You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words. <pre class="prettyprint"><code>> text = nltk.word_tokenize("A car has a door") ['A', 'car', 'has', 'a', 'door'] > tagged_text = nltk.pos_tag(text) [('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')] > pos_tags = [pos for (token,pos) in nltk.pos_tag(text)] ['DT', 'NN', 'VBZ', 'DT', 'NN'] > simple_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP VP -> V NP | VP PP Det -> 'DT' N -> 'NN' V -> 'VBZ' P -> 'PP' """) > parser = nltk.ChartParser(simple_grammar) > tree = parser.parse(pos_tags) </code></pre>

I know this is a year later but I wanted to add some thoughts. I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence). Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags. So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file. The form I use to print it works perfectly with setting up a parser through loading a grammar file (<code>parser = nltk.load_parser('grammar.cfg')</code>). One of the lines it generates looks like this: <code>VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...</code> So now my grammar has the actual words as terminals and assigns the same tags that <code>nltk.tag_pos</code> does. Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar. <pre class="prettyprint"><code>import nltk from collections import defaultdict tag_dict = defaultdict(list) ... """ (Looping through sentences) """ # Tag tagged_sent = nltk.pos_tag(tokens) # Put tags and words into the dictionary for word, tag in tagged_sent: if tag not in tag_dict: tag_dict[tag].append(word) elif word not in tag_dict.get(tag): tag_dict[tag].append(word) # Printing to screen for tag, words in tag_dict.items(): print tag, "->", first_word = True for word in words: if first_word: print "\"" + word + "\"", first_word = False else: print "| \"" + word + "\"", print '' </code></pre>

Combining a Tokenizer into a Grammar and Parser with NLTK

Tags:

python

nlp

grammar

nltk

I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar.

My goal is to build a grammar for a particular text corpus.

(Initial question: Should I even try to start a grammar from scratch or should I start with a predefined grammar? If I should start with another grammar, which is a good one to start with for English?)

Suppose I have the following simple grammar:

simple_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP VP -> V NP | VP PP Det -> 'a' | 'A' N -> 'car' | 'door' V -> 'has' P -> 'in' | 'for'  """);

This grammar can parse a very simple sentence, such as:

parser = nltk.ChartParser(simple_grammar) trees = parser.nbest_parse("A car has a door")

Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar?

For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?

689

asked Feb 01 '11 03:02

speedplane

2 Answers

You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.

> text = nltk.word_tokenize("A car has a door") ['A', 'car', 'has', 'a', 'door']  > tagged_text = nltk.pos_tag(text) [('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]  > pos_tags = [pos for (token,pos) in nltk.pos_tag(text)] ['DT', 'NN', 'VBZ', 'DT', 'NN']  > simple_grammar = nltk.parse_cfg("""   S -> NP VP   PP -> P NP   NP -> Det N | Det N PP   VP -> V NP | VP PP   Det -> 'DT'   N -> 'NN'   V -> 'VBZ'   P -> 'PP'   """)  > parser = nltk.ChartParser(simple_grammar) > tree = parser.parse(pos_tags)

191

answered Sep 22 '22 03:09

Stompchicken

I know this is a year later but I wanted to add some thoughts.

I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence).

Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags.

So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file.

The form I use to print it works perfectly with setting up a parser through loading a grammar file (parser = nltk.load_parser('grammar.cfg')). One of the lines it generates looks like this:

VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...

So now my grammar has the actual words as terminals and assigns the same tags that nltk.tag_pos does.

Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar.

import nltk from collections import defaultdict  tag_dict = defaultdict(list)  ...     """ (Looping through sentences) """          # Tag         tagged_sent = nltk.pos_tag(tokens)          # Put tags and words into the dictionary         for word, tag in tagged_sent:             if tag not in tag_dict:                 tag_dict[tag].append(word)             elif word not in tag_dict.get(tag):                 tag_dict[tag].append(word)  # Printing to screen for tag, words in tag_dict.items():     print tag, "->",     first_word = True     for word in words:         if first_word:             print "\"" + word + "\"",             first_word = False         else:             print "| \"" + word + "\"",     print ''

answered Sep 26 '22 03:09

John

Related questions
                            
                                last_login field is not updated when authenticating using Tokenauthentication in Django Rest Framework
                            
                                Why and When to use Django mark_safe() function
                            
                                How to find the features names of the coefficients using scikit linear regression?
                            
                                Checking for member existence in Python
                            
                                What does the term "blocking" mean in programming?
                            
                                Numpy image - rotate matrix 270 degrees
                            
                                Python equivalent of Ruby's 'method_missing'
                            
                                Python ncurses, CDK, urwid difference
                            
                                Dictionary access speed comparison with integer key against string key
                            
                                How to expose a property (virtual field) on a Django Model as a field in a TastyPie ModelResource
                            
                                Disable ipython console in pycharm
                            
                                Python get mouse x, y position on click
                            
                                Python multi-thread multi-interpreter C API
                            
                                NaN values when new column added to pandas DataFrame
                            
                                What does dtype=object mean while creating a numpy array?
                            
                                Convert column to row in Python Pandas
                            
                                Python 3 range Vs Python 2 range
                            
                                Set pyflake AND mypy ignore same line
                            
                                How to access url hash/fragment from a Django Request object
                            
                                Python "string_escape" vs "unicode_escape"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With