Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word count with java 8

I am trying to implement a word count program in java 8 but I am unable to make it work. The method must take a string as parameter and returns a Map<String,Integer>.

When I am doing it in old java way, everthing works fine. But when I am trying to do it in java 8, it returns a map where the keys are the empty with the correct occurrences.

Here is my code in a java 8 style :

public Map<String, Integer> countJava8(String input){
       return Pattern.compile("(\\w+)").splitAsStream(input).collect(Collectors.groupingBy(e -> e.toLowerCase(), Collectors.reducing(0, e -> 1, Integer::sum)));
    }

Here is the code I would use in a normal situation :

public Map<String, Integer> count(String input){
        Map<String, Integer> wordcount = new HashMap<>();
        Pattern compile = Pattern.compile("(\\w+)");
        Matcher matcher = compile.matcher(input);

        while(matcher.find()){
            String word = matcher.group().toLowerCase();
            if(wordcount.containsKey(word)){
                Integer count = wordcount.get(word);
                wordcount.put(word, ++count);
            } else {
                wordcount.put(word.toLowerCase(), 1);
            }
        }
        return wordcount;
 }

The main program :

public static void main(String[] args) {
       WordCount wordCount = new WordCount();
       Map<String, Integer> phrase = wordCount.countJava8("one fish two fish red fish blue fish");
       Map<String, Integer> count = wordCount.count("one fish two fish red fish blue fish");

        System.out.println(phrase);
        System.out.println();
        System.out.println(count);
    }

When I run this program, the outputs that I have :

{ =7, =1}
{red=1, blue=1, one=1, fish=4, two=1}

I thought that the method splitAsStream would stream the matching elements in the regex as Stream. How can I correct that?

like image 934
Dimitri Avatar asked Jul 02 '15 17:07

Dimitri


People also ask

How do you count words in a string in Java 8?

You can count words in Java String by using the split() method of String. A word is nothing but a non-space character in String, which is separated by one or multiple spaces. By using a regular expression to find spaces and split on them will give you an array of all words in a given String.

How do I count characters in a string in Java 8?

We can use the chars() and codePoints() methods of IntStream to count the number of occurrences of a given character in a string.


1 Answers

The problem seems to be that you are in fact splitting by words, i.e. you are streaming over everything that is not a word, or that is in between words. Unfortunately, there seems to be no equivalent method for streaming the actual match results (hard to believe, but I did not find any; feel free to comment if you know one).

Instead, you could just split by non-words, using \W instead of \w. Also, as noted in comments, you can make it a bit more readable by using String::toLowerCase instead of a lambda and Collectors.summingInt.

public static Map<String, Integer> countJava8(String input) {
    return Pattern.compile("\\W+")
                  .splitAsStream(input)
                  .collect(Collectors.groupingBy(String::toLowerCase,
                                                 Collectors.summingInt(s -> 1)));
}

But IMHO this is still very hard to comprehend, not only because of the "inverse" lookup, and it's also difficult to generalize to other, more complex patterns. Personally, I would just go with the "old school" solution, maybe making it a bit more compact using the new getOrDefault.

public static Map<String, Integer> countOldschool(String input) {
    Map<String, Integer> wordcount = new HashMap<>();
    Matcher matcher = Pattern.compile("\\w+").matcher(input);
    while (matcher.find()) {
        String word = matcher.group().toLowerCase();
        wordcount.put(word, wordcount.getOrDefault(word, 0) + 1);
    }
    return wordcount;
}

The result seems to be the same in both cases.

like image 97
tobias_k Avatar answered Sep 19 '22 06:09

tobias_k