I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: <blockquote> "The big fat cat, said 'your funniest guy i know' to the kangaroo..." </blockquote> the tokenizer would remove the punctuation and return an <code>ArrayList</code> of words the stop word remover would remove words like "the", "to", etc the stemmer would reduce each word the their 'root', for example 'funniest' would become funny Many thanks in advance.

AFAIK Lucene can do what you want. With <code>StandardAnalyzer</code> and <code>StopAnalyzer</code> you can to the stop word removal. In combination with the <code>Lucene contrib-snowball</code> (which includes work from Snowball) project you can do the stemming too. But for stemming also consider this answer to: Stemming algorithm that produces real words

These are standard requirements in Natural Language Processing so I would look in such toolkits. Since you require Java I'd start with OpenNLP: http://opennlp.sourceforge.net/ If you can look at other languages there is also NLTK (Python) Note that "your funniest guy i know" is not standard syntax and this makes it harder to process than "You're the funniest guy I know". Not impossible, but much harder. I don't know of any system that would equate "your" to "you are".

I have dealt with the issue on a number of tasks I have worked with, so let me give a tokenizer suggestion. As I do not see it given directly as an answer, I often use <code>edu.northwestern.at.utils.corpuslinguistics.tokenizer.*</code> as my family of tokenizers. I see a number of cases where I used the <code>PennTreebankTokenizer</code> class. Here is how you use it: <pre class="prettyprint"><code> WordTokenizer wordTokenizer = new PennTreebankTokenizer(); List<String> words = wordTokenizer.extractWords(text); </code></pre> The link to this work is here. Just a disclaimer, I have no affiliation with Northwestern, the group, or the work they do. I am just someone who uses the code occasionally.

Tokenizer, Stop Word Removal, Stemming in Java

3 Answers

AFAIK Lucene can do what you want. With StandardAnalyzer and StopAnalyzer you can to the stop word removal. In combination with the Lucene contrib-snowball (which includes work from Snowball) project you can do the stemming too.

But for stemming also consider this answer to: Stemming algorithm that produces real words

152

answered Nov 02 '22 11:11

jitter

These are standard requirements in Natural Language Processing so I would look in such toolkits. Since you require Java I'd start with OpenNLP: http://opennlp.sourceforge.net/

If you can look at other languages there is also NLTK (Python)

Note that "your funniest guy i know" is not standard syntax and this makes it harder to process than "You're the funniest guy I know". Not impossible, but much harder. I don't know of any system that would equate "your" to "you are".

answered Nov 02 '22 11:11

peter.murray.rust

I have dealt with the issue on a number of tasks I have worked with, so let me give a tokenizer suggestion. As I do not see it given directly as an answer, I often use edu.northwestern.at.utils.corpuslinguistics.tokenizer.* as my family of tokenizers. I see a number of cases where I used the PennTreebankTokenizer class. Here is how you use it:

Click to copy

    WordTokenizer wordTokenizer = new PennTreebankTokenizer();
    List<String> words = wordTokenizer.extractWords(text);

The link to this work is here. Just a disclaimer, I have no affiliation with Northwestern, the group, or the work they do. I am just someone who uses the code occasionally.

answered Nov 02 '22 09:11

demongolem

Related questions
                            
                                Elegant ways to separate configuration from WAR in Tomcat
                            
                                Javascript image editor library [closed]
                            
                                Is there a way to define C# strongly-typed aliases of existing primitive types like `string` or `int`?
                            
                                Is there a python-equivalent of the unix "file" utility?
                            
                                C#: How to test for StackOverflowException
                            
                                QTableView - what signal is sent when user selects a row by clicking to it?
                            
                                How to add more details in MKAnnotation in iOS
                            
                                List of and documentation for system calls for XNU kernel in OSX
                            
                                How to indefinitely pause a thread in Java and later resume it?
                            
                                How to refer to enum constants in c# xml docs
                            
                                Java sockets: multiple client threads on same port on same machine?
                            
                                Obtain container type from (its) iterator type in C++ (STL)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tokenizer, Stop Word Removal, Stemming in Java

Tags:

Phil

People also ask

3 Answers

jitter

peter.murray.rust

demongolem

Recent Activity

Donate For Us