Remove stop words from the parsed content using OpenNLP

Question

I have parsed the document using OpenNLP parser code provided in this link and I got the following output:

(TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ huge) (CC and) (JJ useful)) (NN website)))))

From this I want to extract only meaningful words, meaning I want to remove all stopwords because I want to do classification further based on these meaningful words. Can you please suggest to me how to remove stopwords from the parsed output?

Finally I want to get the below output

   (TOP (S (NP (NN Programcreek)) (JJ useful)) (NN website)))))

Please help me with this, if it is not possible with OpenNLP then suggest me any other Java library for natural language processing. Because my main aim is to parse the document and get the meaningful words only.

c-chavez · Accepted Answer

It seems that OpenNLP doesn't support this feature. You will have to do as Olena Vikariy suggests and implement it yourself, or use a different NLP library in Java like Mallet.

An implementation in Java to remove stop words is the following (doesn't need to be sorted):

String testText = "This is a text you want to test";
String[] stopWords = new String[]{"a", "able", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against", "all"};
String stopWordsPattern = String.join("|", stopWords);
Pattern pattern = Pattern.compile("\b(?:" + stopWordsPattern + ")\b\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(testText);
testText = matcher.replaceAll("");

You can use this list of english stop words.

Alternatively using Mallet you will have to follow the tutorial here. The part to remove stop words is defined using a Pipe for this purpose:

pipeList.add(new TokenSequenceRemoveStopwords(false, false));

Mallet includes a list of stop words so you don't need to define them, but it can also be extended if needed.

Hope this helps.

Remove stop words from the parsed content using OpenNLP

Tags:

java

nlp

stop-words

opennlp

user2598214

1 Answers

c-chavez

Recent Activity

Donate For Us

Remove stop words from the parsed content using OpenNLP

Tags:

java

nlp

stop-words

opennlp

user2598214

1 Answers

c-chavez

Related questions

Recent Activity

Donate For Us