I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg: <pre class="prettyprint"><code>>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'")) [('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'")) [('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'")) [('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] </code></pre> In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well. So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?

One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this: <pre class="prettyprint"><code>>>> import nltk.tag, nltk.data >>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) >>> model = {'select': 'VB'} >>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger) </code></pre> Then you get <pre class="prettyprint"><code>>>> tagger.tag(['select', 'the', 'files']) [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')] </code></pre> This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using <code>train_tagger.py</code> from nltk-trainer and an appropriate corpus.

custom tagging with nltk

Tags:

python

nltk

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:

Click to copy

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'")) [('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'")) [('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] >>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'")) [('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]

In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.

So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?

464

asked May 07 '11 05:05

SpliFF

2 Answers

One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:

Click to copy

>>> import nltk.tag, nltk.data >>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) >>> model = {'select': 'VB'} >>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

Then you get

Click to copy

>>> tagger.tag(['select', 'the', 'files']) [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]

This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.

103

answered Sep 17 '22 23:09

Jacob

Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.

For example, consider the three sentences:

Click to copy

select the files use the select function on the sockets the select was good

Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.

Click to copy

import nltk.tag, nltk.data from nltk import word_tokenize default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)  def evaluate(tagger, sentences):     good,total = 0,0.     for sentence,func in sentences:         tags = tagger.tag(nltk.word_tokenize(sentence))         print tags         good += func(tags)         total += 1     print 'Accuracy:',good/total  sentences = [     ('select the files', lambda tags: ('select', 'VB') in tags),     ('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),     ('the select was good', lambda tags: ('select', 'NN') in tags), ]  train_sents = [     [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],     [('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],     [('the', 'DT'), ('select', 'NN'), ('files', 'NNS')], ]  tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger) evaluate(tagger, sentences) #model = tagger._context_to_tag

Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.

answered Sep 20 '22 23:09

Cerin

Related questions
                            
                                Access Python Development Server from External IP
                            
                                How do I get the name from a named tuple in python?
                            
                                ValueError: Dimension mismatch
                            
                                How can I send a signal from a python program?
                            
                                How to kill a running python process? [duplicate]
                            
                                How to properly create a pyinstaller hook, or maybe hidden import?
                            
                                convert series returned by pandas.Series.value_counts to a dictionary
                            
                                PyCharm Running Out of Memory
                            
                                Unzipping directory structure with python
                            
                                Best way to define multidimensional dictionaries in python? [duplicate]
                            
                                In python how to get name of a class inside its static method
                            
                                python: iterate a specific range in a list
                            
                                Pip Install -r continue past installs that fail
                            
                                Python dictionary in to html table
                            
                                Mocking __init__() for unittesting
                            
                                Scikit-learn is returning coefficient of determination (R^2) values less than -1
                            
                                How does the pyspark mapPartitions function work?
                            
                                How to repeat individual characters in strings in Python
                            
                                How to use AirFlow to run a folder of python files?
                            
                                Dependency version syntax for Python Poetry

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

custom tagging with nltk

Tags:

python

nltk

SpliFF

People also ask

2 Answers

Jacob

Cerin

Recent Activity

Donate For Us