Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the major differences and benefits of Porter and Lancaster Stemming algorithms? [closed]

I'm Working on document classification tasks in java.

Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks?

like image 876
Adam Hess Avatar asked May 11 '12 15:05

Adam Hess


People also ask

What is the difference between the Porter and Lancaster Stemmers?

At the very basics of it, the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer.

What is Lancaster stemming?

Lancaster Stemmer is the most aggressive stemming algorithm. It has an edge over other stemming techniques because it offers us the functionality to add our own custom rules in this algorithm when we implement this using the NLTK package. This sometimes results in abrupt results.

What is the difference between Porter stemmer and snowball Stemmer?

Difference Between Porter Stemmer and Snowball Stemmer: There is only a little difference in the working of these two. Words like 'fairly' and 'sportingly' were stemmed to 'fair' and 'sport' in the snowball stemmer but when you use the porter stemmer they are stemmed to 'fairli' and 'sportingli'.

What is stemming explain Porter's stemming algorithm in detail?

The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.


1 Answers

At the very basics of it, the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer. The three major stemming algorithms in use today are Porter, Snowball(Porter2), and Lancaster (Paice-Husk), with the aggressiveness continuum basically following along those same lines. Porter is the least aggressive algorithm, with the specifics of each algorithm actually being fairly lengthy and technical. Here is a break down for you though:

Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.

Porter2: Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than porter, with a fairly large community around it.

Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

Honestly, I feel that Snowball is usually the way to go. There are certain circumstances in which Lancaster will hugely trim down your working set, which can be very useful, however the marginal speed increase over snowball in my opinion is not worth the lack of precision. Porter has the most implementations though and so is usually the default go-to algorithm, but if you can, use snowball.

Snowball - Additional info

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.

The Snowball compiler translates a Snowball script into another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

History of the name

Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.
---Martin Porter

Stemmers implemented in the Snowball language are sometimes simply referred to as Snowball stemmers. For example, see the Natural Language Toolkit: nltk.stem.snowball.

like image 122
Slater Victoroff Avatar answered Sep 19 '22 12:09

Slater Victoroff