I have POS tagged some words with nltk.pos_tag(), so they are given treebank tags. I would like to lemmatize these words using the known POS tags, but I am not sure how. I was looking at Wordnet lemmatizer, but I am not sure how to convert the treebank POS tags to tags accepted by the lemmatizer. How can I perform this conversion simply, or is there a lemmatizer that uses treebank tags?
The wordnet lemmatizer only knows four parts of speech (ADJ, ADV, NOUN, and VERB) and only the NOUN and VERB rules do anything especially interesting. The noun parts of speech in the treebank tagset all start with NN, the verb tags all start with VB, the adjective tags start with JJ, and the adverb tags start with RB. So, converting from one set of labels to the other is pretty easy, something like:
from nltk.corpus import wordnet
morphy_tag = {'NN':wordnet.NOUN,'JJ':wordnet.ADJ,'VB':wordnet.VERB,'RB':wordnet.ADV}[penn_tag[:2]]
As @engineercoding pointed out in the comments to @rmalouf's answer, there are quite a lot more tags in Treebank compared to WordNet, see here for details.
The following mapping covers as large number of bases as possible, it also explicitly defines POS tags without matches in WordNet:
# Create a map between Treebank and WordNet
from nltk.corpus import wordnet as wn
# WordNet POS tags are: NOUN = 'n', ADJ = 's', VERB = 'v', ADV = 'r', ADJ_SAT = 'a'
# Descriptions (c) https://web.stanford.edu/~jurafsky/slp3/10.pdf
tag_map = {
'CC':None, # coordin. conjunction (and, but, or)
'CD':wn.NOUN, # cardinal number (one, two)
'DT':None, # determiner (a, the)
'EX':wn.ADV, # existential ‘there’ (there)
'FW':None, # foreign word (mea culpa)
'IN':wn.ADV, # preposition/sub-conj (of, in, by)
'JJ':[wn.ADJ, wn.ADJ_SAT], # adjective (yellow)
'JJR':[wn.ADJ, wn.ADJ_SAT], # adj., comparative (bigger)
'JJS':[wn.ADJ, wn.ADJ_SAT], # adj., superlative (wildest)
'LS':None, # list item marker (1, 2, One)
'MD':None, # modal (can, should)
'NN':wn.NOUN, # noun, sing. or mass (llama)
'NNS':wn.NOUN, # noun, plural (llamas)
'NNP':wn.NOUN, # proper noun, sing. (IBM)
'NNPS':wn.NOUN, # proper noun, plural (Carolinas)
'PDT':[wn.ADJ, wn.ADJ_SAT], # predeterminer (all, both)
'POS':None, # possessive ending (’s )
'PRP':None, # personal pronoun (I, you, he)
'PRP$':None, # possessive pronoun (your, one’s)
'RB':wn.ADV, # adverb (quickly, never)
'RBR':wn.ADV, # adverb, comparative (faster)
'RBS':wn.ADV, # adverb, superlative (fastest)
'RP':[wn.ADJ, wn.ADJ_SAT], # particle (up, off)
'SYM':None, # symbol (+,%, &)
'TO':None, # “to” (to)
'UH':None, # interjection (ah, oops)
'VB':wn.VERB, # verb base form (eat)
'VBD':wn.VERB, # verb past tense (ate)
'VBG':wn.VERB, # verb gerund (eating)
'VBN':wn.VERB, # verb past participle (eaten)
'VBP':wn.VERB, # verb non-3sg pres (eat)
'VBZ':wn.VERB, # verb 3sg pres (eats)
'WDT':None, # wh-determiner (which, that)
'WP':None, # wh-pronoun (what, who)
'WP$':None, # possessive (wh- whose)
'WRB':None, # wh-adverb (how, where)
'$':None, # dollar sign ($)
'#':None, # pound sign (#)
'“':None, # left quote (‘ or “)
'”':None, # right quote (’ or ”)
'(':None, # left parenthesis ([, (, {, <)
')':None, # right parenthesis (], ), }, >)
',':None, # comma (,)
'.':None, # sentence-final punc (. ! ?)
':':None # mid-sentence punc (: ; ... – -)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With