I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:
>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'")) [('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'")) [('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'")) [('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.
So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?
Summary. POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.
You will need a lot of samples already labeled with POS tags. Then you can use the samples to train a RNN. The x input to the RNN will be the sequence of tokens (words) and the y output will be the POS tags. The RNN, once trained, can be used as a POS tagger.
IN preposition/subordinating conjunction. JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest'
Backoff tagging is one of the core features of SequentialBackoffTagger . It allows you to chain taggers together so that if one tagger doesn't know how to tag a word, it can pass the word on to the next backoff tagger.
One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:
>>> import nltk.tag, nltk.data >>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) >>> model = {'select': 'VB'} >>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
Then you get
>>> tagger.tag(['select', 'the', 'files']) [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]
This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py
from nltk-trainer and an appropriate corpus.
Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.
For example, consider the three sentences:
select the files use the select function on the sockets the select was good
Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.
import nltk.tag, nltk.data from nltk import word_tokenize default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) def evaluate(tagger, sentences): good,total = 0,0. for sentence,func in sentences: tags = tagger.tag(nltk.word_tokenize(sentence)) print tags good += func(tags) total += 1 print 'Accuracy:',good/total sentences = [ ('select the files', lambda tags: ('select', 'VB') in tags), ('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags), ('the select was good', lambda tags: ('select', 'NN') in tags), ] train_sents = [ [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')], [('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')], [('the', 'DT'), ('select', 'NN'), ('files', 'NNS')], ] tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger) evaluate(tagger, sentences) #model = tagger._context_to_tag
Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With