I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): http://tartarus.org/~martin/PorterStemmer/php.txt This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball" (suggested within another Stack Overflow thread). http://snowball.tartarus.org/demo.php For my example (community / communities), Snowball stems to "communiti". Question Are there any other stemming algorithms that will do this? Has anyone else solved this problem? My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.

If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word. There are many lemmatizers for English, I've only used <code>morpha</code> though. Morpha is just a big lex-file which you can compile into an executable. Usage example: <pre class="prettyprint"><code>$ cat test.txt Community Communities $ cat test.txt | ./morpha -uc Community Community </code></pre> You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

The core issue here is that stemming algorithms operate <strike>on a phonetic basis</strike> purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this: <ol> <li>Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)</li> <li>Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)</li> </ol> Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

Stemming algorithm that produces real words

Tags:

php

nlp

stemming

snowball

porter-stemmer

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):

http://tartarus.org/~martin/PorterStemmer/php.txt

This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

I've tried "Snowball" (suggested within another Stack Overflow thread).

http://snowball.tartarus.org/demo.php

For my example (community / communities), Snowball stems to "communiti".

Question

Are there any other stemming algorithms that will do this? Has anyone else solved this problem?

My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.

684

asked Oct 10 '08 10:10

Dave

2 Answers

If I understand correctly, then what you need is not a stemmer but a lemmatizer. Lemmatizer is a tool with knowledge about endings like -ies, -ed, etc., and exceptional wordforms like written, etc. Lemmatizer maps the input wordform to its lemma, which is guaranteed to be a "real" word.

There are many lemmatizers for English, I've only used morpha though. Morpha is just a big lex-file which you can compile into an executable. Usage example:

Click to copy

$ cat test.txt  Community Communities $ cat test.txt | ./morpha -uc Community Community

You can get morpha from http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

113

answered Sep 22 '22 22:09

Kaarel

The core issue here is that stemming algorithms operate ~~on a phonetic basis~~ purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

answered Sep 23 '22 22:09

Dave Sherohman

Related questions
                            
                                PHP, how to pass func-get-args values to another function as list of arguments?
                            
                                What is the difference between split() and explode()?
                            
                                Profiling PHP code
                            
                                PHP $string{0} vs. $string[0];
                            
                                MySQL/SQL retrieve first 40 characters of a text field?
                            
                                How to select PHP version 5 and 7 per virtualhost in Apache 2.4 on Debian?
                            
                                symfony redirect with 2 parameters
                            
                                Is it better to use require_once('filename.php') or require_once 'filename.php';
                            
                                PHP best way to check whether a string is empty or not
                            
                                Limit amount of links shown with Laravel pagination
                            
                                How to run PHP exec() as root?
                            
                                Try Catch cannot work with require_once in PHP?
                            
                                search a php array for partial string match [duplicate]
                            
                                Laravel retrieve binded model in Request
                            
                                Where does IIS 7.5 log errors?
                            
                                wkhtmltopdf - libfontconfig.so.1: cannot open shared object file [closed]
                            
                                Are Magic Methods Best practice in PHP? [closed]
                            
                                Will enabling XDebug on a production server make PHP slower?
                            
                                This distribution is not configured to allow the HTTP request
                            
                                How to replace decoded Non-breakable space (nbsp)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With