Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

custom tagging with nltk

Tags:

python

nltk

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'")) [('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'")) [('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'")) [('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] 

In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.

So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?

like image 464
SpliFF Avatar asked May 07 '11 05:05

SpliFF


People also ask

What is tagging in NLTK?

Summary. POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.

How do you make a POS tagger?

You will need a lot of samples already labeled with POS tags. Then you can use the samples to train a RNN. The x input to the RNN will be the sequence of tokens (words) and the y output will be the POS tags. The RNN, once trained, can be used as a POS tagger.

What is JJ in POS-tagging?

IN preposition/subordinating conjunction. JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest'

Why should we use backoff options when tagging with NLTK?

Backoff tagging is one of the core features of SequentialBackoffTagger . It allows you to chain taggers together so that if one tagger doesn't know how to tag a word, it can pass the word on to the next backoff tagger.


2 Answers

One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:

>>> import nltk.tag, nltk.data >>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) >>> model = {'select': 'VB'} >>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger) 

Then you get

>>> tagger.tag(['select', 'the', 'files']) [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')] 

This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.

like image 103
Jacob Avatar answered Sep 17 '22 23:09

Jacob


Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.

For example, consider the three sentences:

select the files use the select function on the sockets the select was good 

Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.

import nltk.tag, nltk.data from nltk import word_tokenize default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)  def evaluate(tagger, sentences):     good,total = 0,0.     for sentence,func in sentences:         tags = tagger.tag(nltk.word_tokenize(sentence))         print tags         good += func(tags)         total += 1     print 'Accuracy:',good/total  sentences = [     ('select the files', lambda tags: ('select', 'VB') in tags),     ('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),     ('the select was good', lambda tags: ('select', 'NN') in tags), ]  train_sents = [     [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],     [('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],     [('the', 'DT'), ('select', 'NN'), ('files', 'NNS')], ]  tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger) evaluate(tagger, sentences) #model = tagger._context_to_tag 

Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.

like image 24
Cerin Avatar answered Sep 20 '22 23:09

Cerin