I'm Working on document classification tasks in java.
Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks?
At the very basics of it, the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer.
Lancaster Stemmer is the most aggressive stemming algorithm. It has an edge over other stemming techniques because it offers us the functionality to add our own custom rules in this algorithm when we implement this using the NLTK package. This sometimes results in abrupt results.
Difference Between Porter Stemmer and Snowball Stemmer: There is only a little difference in the working of these two. Words like 'fairly' and 'sportingly' were stemmed to 'fair' and 'sport' in the snowball stemmer but when you use the porter stemmer they are stemmed to 'fairli' and 'sportingli'.
The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.
At the very basics of it, the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer. The three major stemming algorithms in use today are Porter, Snowball(Porter2), and Lancaster (Paice-Husk), with the aggressiveness continuum basically following along those same lines. Porter is the least aggressive algorithm, with the specifics of each algorithm actually being fairly lengthy and technical. Here is a break down for you though:
Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.
Porter2: Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than porter, with a fairly large community around it.
Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.
Honestly, I feel that Snowball is usually the way to go. There are certain circumstances in which Lancaster will hugely trim down your working set, which can be very useful, however the marginal speed increase over snowball in my opinion is not worth the lack of precision. Porter has the most implementations though and so is usually the default go-to algorithm, but if you can, use snowball.
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.
The Snowball compiler translates a Snowball script into another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.
Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.
---Martin Porter
Stemmers implemented in the Snowball language are sometimes simply referred to as Snowball stemmers. For example, see the Natural Language Toolkit: nltk.stem.snowball.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With