I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities
I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):
http://tartarus.org/~martin/PorterStemmer/php.txt
This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".
I've tried "Snowball" (suggested within another Stack Overflow thread).
http://snowball.tartarus.org/demo.php
For my example (community / communities), Snowball stems to "communiti".
Question
Are there any other stemming algorithms that will do this? Has anyone else solved this problem?
My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.
stemming algorithms can be classified in three groups: truncating methods, statistical methods, and mixed methods. Each of these groups has a typical way of finding the stems of the word variants.
This study evaluated the performance analysis of the basic three suffix removal stemming algorithms in the English language called Lovins, Porter and Paice/Husk by Accuracy and strength of the algorithm. Conflation Methods Affix removal algorithms remove affixes or prefixes from terms leaving a stem.
What is Stemming? Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.
There are many lemmatizers for English, I've only used morpha
though. Morpha is just a big lex-file which you can compile into an executable. Usage example:
$ cat test.txt Community Communities $ cat test.txt | ./morpha -uc Community Community
You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:
Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With