I need to lemmatize text using nltk. In order to do this, I apply nltk.pos_tag
to each sentence and then convert the resulting Penn Treebank tags (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to WordNet tags. I need to do this because WordNetLemmatizer.lemmatize()
expects both the word and its correct pos_tag as arguments, otherwise it will just assume everything is a verb.
I just found that there are five different tags defined in WordNet:
However, every example I found on the internet just ignores wn.ADJ_SAT when converting Treebank tags to WordNet tags. They are all just mapping Penn tags to WordNet tags like this:
So wn.ADJ_SAT is never used.
My question now is if there are cases where the lemmatizer returns a different result for ADJ_SAT than for ADJ. What are examples for words that are satellite adjectives (ADJ_SAT) and no normal adjectives (ADJ)?
Wordnet Lemmatizer (with POS tag) In the above approach, we observed that Wordnet results were not up to the mark. Words like ‘sitting’, ‘flying’ etc remained the same after lemmatization. This is because these words are treated as a noun in the given sentence rather than a verb. To overcome come this, we use POS (Part of Speech) tags.
The following are 30 code examples of nltk.stem.WordNetLemmatizer () . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may also want to check out all available functions/classes of the module nltk.stem , or try the search function .
The wordnet lemmatizer only knows four parts of speech (ADJ, ADV, NOUN, and VERB) and only the NOUN and VERB rules do anything especially interesting. The noun parts of speech in the treebank tagset all start with NN, the verb tags all start with VB, the adjective tags start with JJ, and the adverb tags start with RB.
Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique.
The WordNetLemmatizer
in NLTK
does not differentiate satellite adjectives from normal adjectives.
nltk.stem.WordNetLemmatizer.lemmatize
is uses "WordNet’s built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet."
In WordNet, a satellite adjective--more broadly referred to as a satellite synset--is more of a semantic label used elsewhere in WordNet than a special part-of-speech in nltk.
From the wordnet glossary:
Satellite Synset: Synset in an adjective cluster representing a concept that is similar in meaning to the concept represented by its head synset .
User tripleee
points out in this question the following:
adjectives are subcategorized into 'head' and 'satellite' synsets within an adjective clutser
Also, the nltk
documentation for nltk.stem.WordNetLemmatizer.lemmatize
assumes the default part of speech to be a noun instead of a verb, unless otherwise specified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With