Environment: Ubuntu x86_64 (14.10), Oracle JDK 1.8u25
I try and use a parallel stream of Files.lines()
but I want to .skip()
the first line (it's a CSV file with a header). Therefore I try and do this:
try ( final Stream<String> stream = Files.lines(thePath, StandardCharsets.UTF_8) .skip(1L).parallel(); ) { // etc }
But then one column failed to parse to an int...
So I tried some simple code. The file is question is dead simple:
$ cat info.csv startDate;treeDepth;nrMatchers;nrLines;nrChars;nrCodePoints;nrNodes 1422758875023;34;54;151;4375;4375;27486 $
And the code is equally simple:
public static void main(final String... args) { final Path path = Paths.get("/home/fge/tmp/dd/info.csv"); Files.lines(path, StandardCharsets.UTF_8).skip(1L).parallel() .forEach(System.out::println); }
And I systematically get the following result (OK, I have only run it something around 20 times):
startDate;treeDepth;nrMatchers;nrLines;nrChars;nrCodePoints;nrNodes
What am I missing here?
EDIT It seems like the problem, or misunderstanding, is much more rooted than that (the two examples below were cooked up by a fellow on FreeNode's ##java):
public static void main(final String... args) { new BufferedReader(new StringReader("Hello\nWorld")).lines() .skip(1L).parallel() .forEach(System.out::println); final Iterator<String> iter = Arrays.asList("Hello", "World").iterator(); final Spliterator<String> spliterator = Spliterators.spliteratorUnknownSize(iter, Spliterator.ORDERED); final Stream<String> s = StreamSupport.stream(spliterator, true); s.skip(1L).forEach(System.out::println); }
This prints:
Hello Hello
Uh.
@Holger suggested that this happens for any stream which is ORDERED
and not SIZED
with this other sample:
Stream.of("Hello", "World") .filter(x -> true) .parallel() .skip(1L) .forEach(System.out::println);
Also, it stems from all the discussion which already took place that the problem (if it is one?) is with .forEach()
(as @SotiriosDelimanolis first pointed out).
Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g. This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.
stream() works in sequence on a single thread with the println() operation. list. parallelStream(), on the other hand, is processed in parallel, taking full advantage of the underlying multicore environment. The interesting aspect is in the output of the preceding program.
Parallel streams provide the capability of parallel processing over collections that are not thread-safe. It is although required that one does not modify the collection during the parallel processing.
When a stream executes in parallel, the Java runtime partitions the stream into multiple substreams. Aggregate operations iterate over and process these substreams in parallel and then combine the results. When you create a stream, it is always a serial stream unless otherwise specified.
Since the current state of the issue is quite the opposite of the earlier statements made here, it should be noted, that there is now an explicit statement by Brian Goetz about the back-propagation of the unordered characteristic past a skip
operation is considered a bug. It’s also stated that it is now considered to have no back-propagation of the ordered-ness of a terminal operation at all.
There is also a related bug report, JDK-8129120 whose status is “fixed in Java 9” and it’s backported to Java 8, update 60
I did some tests with jdk1.8.0_60
and it seems that the implementation now indeed exhibits the more intuitive behavior.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With