Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 8 stream and parallelStream

Suppose that we have a Collection like this :

Set<Set<Integer>> set = Collections.newSetFromMap(new ConcurrentHashMap<>());
for (int i = 0; i < 10; i++) {
    Set<Integer> subSet = Collections.newSetFromMap(new ConcurrentHashMap<>());
    subSet.add(1 + (i * 5));
    subSet.add(2 + (i * 5));
    subSet.add(3 + (i * 5));
    subSet.add(4 + (i * 5));
    subSet.add(5 + (i * 5));
    set.add(subSet);
}

and to process it :

set.stream().forEach(subSet -> subSet.stream().forEach(System.out::println));

or

set.parallelStream().forEach(subSet -> subSet.stream().forEach(System.out::println));

or

set.stream().forEach(subSet -> subSet.parallelStream().forEach(System.out::println));

or

set.parallelStream().forEach(subSet -> subSet.parallelStream().forEach(System.out::println));

so, can someone please explain me :

  • What is the difference between them?
  • Which one is better? faster? and safer?
  • Which one is good for huge collections?
  • Which one is good when we want to apply heavy processes to each item?
like image 692
FaNaJ Avatar asked Dec 12 '14 16:12

FaNaJ


1 Answers

What is the difference between them?

Think of it as like two nested loops.

  • In the first case there is no parallelism,
  • in the second case the outer loop/collection is parallel
  • In the third case the inner loop/collection if more parallel.
  • in the last case, you have a mixture of parallelism which is likely to be more confusing than useful.

The forth case isn't clear as there is only one thread pool in reality and if the pool is busy the current thread can be used, ie it might not be parallel^2 at all.

Which one is better? faster? and safer?

The first one, however using a flat map would be simpler again.

set.stream().flatMap(s -> s.stream()).forEach(System.out::println);

The other versions are more complicated and since the console, which is the bottle neck, is a shared resource, the multi-threaded version are likely to be slower.

Which one is good for huge collections?

Assuming your aim is to do something other than print, you want to enough tasks to keep all your CPUs busy, but not so many tasks it creates overhead. The second option might be worth considering.

Which one is good when we want to apply heavy processes to each item?

Again the second example, might be best, or possibly the third if you have a small number of outer collections.

like image 156
Peter Lawrey Avatar answered Oct 25 '22 06:10

Peter Lawrey