I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.
My question is what is the best shot inorder to perform the above lemmatization accurately?
I did the pos tagging using nltk.pos_tag
and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help
from nltk.stem.wordnet import WordNetLemmatizer lmtzr = WordNetLemmatizer() tagged = nltk.pos_tag(tokens)
I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?
Also do I have to train nltk.pos_tag()
with a tagged corpus or can I use it directly on my data to evaluate?
Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.
Parts of Speech (POS) Tagging. Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.
First of all, you can use nltk.pos_tag()
directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER
:
nltk.tag._POS_TAGGER >>> 'taggers/maxent_treebank_pos_tagger/english.pickle'
As it was trained with the Treebank corpus, it also uses the Treebank tag set.
The following function would map the treebank tags to WordNet part of speech names:
from nltk.corpus import wordnet def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return ''
You can then use the return value with the lemmatizer:
from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('going', wordnet.VERB) >>> 'go'
Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError
.
As in the source code of nltk.corpus.reader.wordnet (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)
#{ Part-of-speech constants ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' #} POS_LIST = [NOUN, VERB, ADJ, ADV]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With