Simple question: When do we stem or lemmatize the words? Is stemming helpful for all nlp processes or are there applications where using full form of words might result in better accuracy or precision?
In the context of machine learning based NLP, stemming makes your training data more dense. It reduces the size of the dictionary (number of words used in the corpus) two or three-fold (of even more for languages with many flections like French, where a single stem can generate dozens of words in case of verbs for instance).
Having the same corpus, but less input dimensions, ML will work better. Recall should really be better.
The downside is, if in some cases the actual word (as opposed to its stem) makes a difference, then your system won't be able to leverage it. So you might lose some precision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With