Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove stop words from the parsed content using OpenNLP

I have parsed the document using OpenNLP parser code provided in this link and I got the following output:

(TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ huge) (CC and) (JJ useful)) (NN website)))))

From this I want to extract only meaningful words, meaning I want to remove all stopwords because I want to do classification further based on these meaningful words. Can you please suggest to me how to remove stopwords from the parsed output?

Finally I want to get the below output

   (TOP (S (NP (NN Programcreek)) (JJ useful)) (NN website)))))

Please help me with this, if it is not possible with OpenNLP then suggest me any other Java library for natural language processing. Because my main aim is to parse the document and get the meaningful words only.

like image 696
user2598214 Avatar asked Jul 19 '13 05:07

user2598214


1 Answers

It seems that OpenNLP doesn't support this feature. You will have to do as Olena Vikariy suggests and implement it yourself, or use a different NLP library in Java like Mallet.

An implementation in Java to remove stop words is the following (doesn't need to be sorted):

String testText = "This is a text you want to test";
String[] stopWords = new String[]{"a", "able", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against", "all"};
String stopWordsPattern = String.join("|", stopWords);
Pattern pattern = Pattern.compile("\\b(?:" + stopWordsPattern + ")\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(testText);
testText = matcher.replaceAll("");

You can use this list of english stop words.

Alternatively using Mallet you will have to follow the tutorial here. The part to remove stop words is defined using a Pipe for this purpose:

pipeList.add(new TokenSequenceRemoveStopwords(false, false));

Mallet includes a list of stop words so you don't need to define them, but it can also be extended if needed.

Hope this helps.

like image 134
c-chavez Avatar answered Sep 22 '22 07:09

c-chavez