Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.

Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.

Stemmers

[in]: having [out]: hav 

Lemmatizers

[in]: having [out]: have 
  • So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English

  • If not, then how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?

  • How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?

like image 255
alvas Avatar asked Jun 26 '13 10:06

alvas


People also ask

Which is better lemmatization vs stemming?

Instead, lemmatization provides better results by performing an analysis that depends on the word's part-of-speech and producing real, dictionary words. As a result, lemmatization is harder to implement and slower compared to stemming.

What is the main difference between stemming and lemmatization?

Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.

What is the difference between lemmatization and stemming explain Porter's algorithm with suitable example?

Stemming and Lemmatization both generate the foundation sort of the inflected words and therefore the only difference is that stem may not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.

Should I do both lemmatization and stemming?

Short answer- go with stemming when the vocab space is small and the documents are large. Conversely, go with word embeddings when the vocab space is large but the documents are small. However, don't use lemmatization as the increased performance to increased cost ratio is quite low.


1 Answers

Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"

Yes. Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving by driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.

Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?

What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?

If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.

If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, some might want it fined-grained.

Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"

What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).

With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although combination of machine learning and manually created rules has been done too (see e.g. this).

Obviously for most languages you do not want to create the lookup table by hand, but instead generate it from a description of morphology of that language. For inflectional languages, you can go the engineering way of Hajic for Czech or Mikheev for Russian, or, if you are daring, you use two-level morphology. Or you can do something in between, such as Hana (myself) (Note that these are all full morphological analyzers that include lemmatization). Or you can learn the lemmatizer in an unsupervised manner a la Yarowsky and Wicentowski, possibly with manual post-processing, correcting the most frequent words.

There are way too many options and it really all depends what you want to do with the results.

like image 110
Jirka Avatar answered Oct 02 '22 22:10

Jirka