Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Stream API: why the distinction between sequential and parallel execution mode?

Tags:

From the Stream javadoc:

Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution.

My assumptions:

  1. There is no functional difference between a sequential/parallel streams. Output is never affected by execution mode.
  2. A parallel stream is always preferable, given appropriate number of cores and problem size to justify the overhead, due to the performance gains.
  3. We want to write code once and run anywhere without having to care about the hardware (this is Java, after all).

Assuming these assumptions are valid (nothing wrong with a bit of meta-assumption), what's the value in having the execution mode exposed in the api?

It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.

Sure, assuming parallel streams also work on a single core machine, perhaps just always using a parallel stream achieves this. But this is really ugly - why have explicit references to parallel streams in my code when it's the default option?

Even if there is a scenario where you'd deliberately want to hard code the use of a sequential stream - why is there not just a sub-interface SequentialStream for that purpose, rather than polluting Stream with an execution mode switch?

like image 747
davnicwil Avatar asked Apr 09 '14 00:04

davnicwil


1 Answers

It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.

The reality is that a) streams are a library, and have no special JVM magic, and b) you can't really design a library smart enough to automagically figure out what the right decision is in this particular case. There's no sensible way to estimate how costly a particular function will be without running it -- even if you could introspect its implementation, which you can't -- and now you're introducing a benchmark into every stream operation, trying to figure out if parallelizing it will be worth the cost of the parallelism overhead. That's just not practical, especially given that you don't know in advance how bad the parallelism overhead is, either.

A parallel stream is always preferable, given appropriate number of cores and problem size to justify the overhead, due to the performance gains.

Not always, in practice. Some tasks are just so small that they're not worth parallelizing, and parallelism does always have some overhead. (And frankly, most programmers tend to overestimate the usefulness of parallelism, slapping it everywhere when it's really hurting performance.)

Basically, it's a hard enough problem that you basically have to shove it off onto the programmer.

like image 192
Louis Wasserman Avatar answered Oct 08 '22 06:10

Louis Wasserman