Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wordnet lemmatization and pos tagging in python

I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.

My question is what is the best shot inorder to perform the above lemmatization accurately?

I did the pos tagging using nltk.pos_tag and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help

from nltk.stem.wordnet import WordNetLemmatizer lmtzr = WordNetLemmatizer() tagged = nltk.pos_tag(tokens) 

I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?

Also do I have to train nltk.pos_tag() with a tagged corpus or can I use it directly on my data to evaluate?

like image 592
user1946217 Avatar asked Mar 23 '13 12:03

user1946217


People also ask

Is Wordnet used in Lemmatization?

Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

What is Lemmatization in Python?

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.

What is POS Tagging in Python?

Parts of Speech (POS) Tagging. Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.


2 Answers

First of all, you can use nltk.pos_tag() directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER >>> 'taggers/maxent_treebank_pos_tagger/english.pickle'  

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

The following function would map the treebank tags to WordNet part of speech names:

from nltk.corpus import wordnet  def get_wordnet_pos(treebank_tag):      if treebank_tag.startswith('J'):         return wordnet.ADJ     elif treebank_tag.startswith('V'):         return wordnet.VERB     elif treebank_tag.startswith('N'):         return wordnet.NOUN     elif treebank_tag.startswith('R'):         return wordnet.ADV     else:         return '' 

You can then use the return value with the lemmatizer:

from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('going', wordnet.VERB) >>> 'go' 

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

like image 162
Suzana Avatar answered Sep 21 '22 23:09

Suzana


As in the source code of nltk.corpus.reader.wordnet (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)

#{ Part-of-speech constants  ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' #} POS_LIST = [NOUN, VERB, ADJ, ADV] 
like image 33
pg2455 Avatar answered Sep 21 '22 23:09

pg2455