Word frequency count Java 8

Tags:

How to count the frequency of words of List in Java 8?

List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");

The result must be:

{ciao=2, hello=1, bye=2}

268

asked Mar 18 '15 12:03

3 Answers

I want to share the solution I found because at first I expected to use map-and-reduce methods, but it was a bit different.

Map<String,Long> collect = wordsList.stream()
    .collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));

Or for Integer values:

Map<String,Integer> collect = wordsList.stream()
     .collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));

EDIT

I add how to sort the map by value:

LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
            .stream()
            .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
            .collect(Collectors.toMap(
                    Map.Entry::getKey,
                    Map.Entry::getValue,
                    (v1, v2) -> {
                        throw new IllegalStateException();
                    },
                    LinkedHashMap::new
            ));

answered Oct 24 '22 00:10

Mouna

(NOTE: See the edits below)

As an alternative to Mounas answer, here is an approach that does the word count in parallel:

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class ParallelWordCount
{
    public static void main(String[] args)
    {
        List<String> list = Arrays.asList(
            "hello", "bye", "ciao", "bye", "ciao");
        Map<String, Integer> counts = list.parallelStream().
            collect(Collectors.toConcurrentMap(
                w -> w, w -> 1, Integer::sum));
        System.out.println(counts);
    }
}

EDIT In response to the comment, I ran a small test with JMH, comparing the toConcurrentMap and the groupingByConcurrent approach, with different input list sizes and random words of different lengths. This test suggested that the toConcurrentMap approach was faster. When considering how different these approaches are "under the hood", it's hard to predict something like this.

As a further extension, based on further comments, I extended the test to cover all four combinations of toMap, groupingBy, serial and parallel.

The results are still that the toMap approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:

             (method)  (count) (wordLength)  Mode  Cnt     Score    Error  Units
      toConcurrentMap     1000            2  avgt   50   146,636 ±  0,880  us/op
      toConcurrentMap     1000            5  avgt   50   272,762 ±  1,232  us/op
      toConcurrentMap     1000           10  avgt   50   271,121 ±  1,125  us/op
                toMap     1000            2  avgt   50    44,396 ±  0,541  us/op
                toMap     1000            5  avgt   50    46,938 ±  0,872  us/op
                toMap     1000           10  avgt   50    46,180 ±  0,557  us/op
           groupingBy     1000            2  avgt   50    46,797 ±  1,181  us/op
           groupingBy     1000            5  avgt   50    68,992 ±  1,537  us/op
           groupingBy     1000           10  avgt   50    68,636 ±  1,349  us/op
 groupingByConcurrent     1000            2  avgt   50   231,458 ±  0,658  us/op
 groupingByConcurrent     1000            5  avgt   50   438,975 ±  1,591  us/op
 groupingByConcurrent     1000           10  avgt   50   437,765 ±  1,139  us/op
      toConcurrentMap    10000            2  avgt   50   712,113 ±  6,340  us/op
      toConcurrentMap    10000            5  avgt   50  1809,356 ±  9,344  us/op
      toConcurrentMap    10000           10  avgt   50  1813,814 ± 16,190  us/op
                toMap    10000            2  avgt   50   341,004 ± 16,074  us/op
                toMap    10000            5  avgt   50   535,122 ± 24,674  us/op
                toMap    10000           10  avgt   50   511,186 ±  3,444  us/op
           groupingBy    10000            2  avgt   50   340,984 ±  6,235  us/op
           groupingBy    10000            5  avgt   50   708,553 ±  6,369  us/op
           groupingBy    10000           10  avgt   50   712,858 ± 10,248  us/op
 groupingByConcurrent    10000            2  avgt   50   901,842 ±  8,685  us/op
 groupingByConcurrent    10000            5  avgt   50  3762,478 ± 21,408  us/op
 groupingByConcurrent    10000           10  avgt   50  3795,530 ± 32,096  us/op

I'm not so experienced with JMH, maybe I did something wrong here - suggestions and corrections are welcome:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Thread)
public class ParallelWordCount
{

    @Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
    public String method;

    @Param({"2", "5", "10"})
    public int wordLength;

    @Param({"1000", "10000" })
    public int count;

    private List<String> list;

    @Setup
    public void initList()
    {
         list = createRandomStrings(count, wordLength, new Random(0));
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public void testMethod(Blackhole bh)
    {

        if (method.equals("toMap"))
        {
            Map<String, Integer> counts =
                list.stream().collect(
                    Collectors.toMap(
                        w -> w, w -> 1, Integer::sum));
            bh.consume(counts);
        }
        else if (method.equals("toConcurrentMap"))
        {
            Map<String, Integer> counts =
                list.parallelStream().collect(
                    Collectors.toConcurrentMap(
                        w -> w, w -> 1, Integer::sum));
            bh.consume(counts);
        }
        else if (method.equals("groupingBy"))
        {
            Map<String, Long> counts =
                list.stream().collect(
                    Collectors.groupingBy(
                        Function.identity(), Collectors.<String>counting()));
            bh.consume(counts);
        }
        else if (method.equals("groupingByConcurrent"))
        {
            Map<String, Long> counts =
                list.parallelStream().collect(
                    Collectors.groupingByConcurrent(
                        Function.identity(), Collectors.<String> counting()));
            bh.consume(counts);
        }
    }

    private static String createRandomString(int length, Random random)
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < length; i++)
        {
            int c = random.nextInt(26);
            sb.append((char) (c + 'a'));
        }
        return sb.toString();
    }

    private static List<String> createRandomStrings(
        int count, int length, Random random)
    {
        List<String> list = new ArrayList<String>(count);
        for (int i = 0; i < count; i++)
        {
            list.add(createRandomString(length, random));
        }
        return list;
    }
}

The times are only similar for the serial case of a list with 10000 elements, and 2-letter words.

It could be worthwhile to check whether for even larger list sizes, the concurrent versions eventually outperform the serial ones, but currently don't have the time for another detailed benchmark run with all these configurations.

answered Oct 24 '22 02:10

Marco13

Find most frequent item in collection, with generics:

private <V> V findMostFrequentItem(final Collection<V> items)
{
  return items.stream()
      .filter(Objects::nonNull)
      .collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
      .entrySet()
      .stream()
      .max(Comparator.comparing(Entry::getValue))
      .map(Entry::getKey)
      .orElse(null);
}

Compute item frequencies:

private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
  return items.stream()
      .filter(Objects::nonNull)
      .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}

answered Oct 24 '22 00:10

nejckorasa

Related questions
                            
                                Private enum constructor
                            
                                Converting File to MultiPartFile
                            
                                What's the difference between ".equals" and "=="? [duplicate]
                            
                                Why is UncaughtExceptionHandler not called by ExecutorService?
                            
                                Why do abstract classes in Java have constructors? [duplicate]
                            
                                How to chain multiple different InputStreams into one InputStream
                            
                                Classical set operations for java.util.Collection
                            
                                java.lang.NoClassDefFoundError: org/apache/commons/fileupload/FileItemFactory
                            
                                Compile error: package javax.servlet does not exist
                            
                                How to convert JSON string into List of Java object?
                            
                                Java how to call method by reflection with primitive types as arguments
                            
                                Split string on spaces in Java, except if between quotes (i.e. treat \"hello world\" as one token) [duplicate]
                            
                                Erlang Processes vs Java Threads
                            
                                NoClassDefFoundError: wrong name
                            
                                How could I split a String into an array in Kotlin?
                            
                                Android: how to parse URL String with spaces to URI object?
                            
                                Hibernate Slow to Acquire Postgres Connection
                            
                                Java enum with multiple value types
                            
                                find out the elements of an arraylist which is not present in another arraylist
                            
                                How to uninstall plugins in Eclipse 3.4.X or higher

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Word frequency count Java 8

Tags:

java

java-8

java-stream

word-count

Mouna

People also ask

3 Answers

Mouna

Marco13

nejckorasa

Recent Activity

Donate For Us