Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing stopwords from a String in Java

I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

After removing stopwords, string should be like :

"love phone, super fast much cool jelly bean....but recently bugs."

I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

Here's my stopwordslist.txt file : Stopwords

How can I solve this problem. Here's what I have done so far :

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }
like image 327
JavaLearner Avatar asked Dec 29 '14 08:12

JavaLearner


People also ask

How do I remove Stopwords from a string?

To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK. In the script above, we first import the stopwords collection from the nltk. corpus module. Next, we import the word_tokenize() method from the nltk.

How do you remove stop words in Java?

You can remove stop words from a text file by using pattern matching in java. You can have all you stop words in a separate property file.

Why Stopwords are removed?

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

What is Stopwords removal?

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.


2 Answers

This is a much more elegant solution (IMHO), using only regular expressions:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);
like image 156
geert3 Avatar answered Nov 04 '22 11:11

geert3


Try the program below.

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

OUTPUT: love phone, its super fast so much new cool things with jelly bean....but of recently I've seen some bugs.

like image 4
robin Avatar answered Nov 04 '22 09:11

robin