Understanding the main loop in Streams API's ForEachTask

Question

It seems that the centerpiece of Java Streams' parallelization is the ForEachTask. Understanding its logic appears to be essential to acquiring the mental model necessary to anticipate the concurrent behavior of client code written against the Streams API. Yet I find my anticipations contradicted by the actual behavior.

For reference, here is the key compute() method (java/util/streams/ForEachOps.java:253):

public void compute() {
  Spliterator<S> rightSplit = spliterator, leftSplit;
  long sizeEstimate = rightSplit.estimateSize(), sizeThreshold;
  if ((sizeThreshold = targetSize) == 0L)
    targetSize = sizeThreshold = AbstractTask.suggestTargetSize(sizeEstimate);
  boolean isShortCircuit = StreamOpFlag.SHORT_CIRCUIT.isKnown(helper.getStreamAndOpFlags());
  boolean forkRight = false;
  Sink<S> taskSink = sink;
  ForEachTask<S, T> task = this;
  while (!isShortCircuit || !taskSink.cancellationRequested()) {
    if (sizeEstimate <= sizeThreshold ||
        (leftSplit = rightSplit.trySplit()) == null) {
      task.helper.copyInto(taskSink, rightSplit);
      break;
    }
    ForEachTask<S, T> leftTask = new ForEachTask<>(task, leftSplit);
    task.addToPendingCount(1);
    ForEachTask<S, T> taskToFork;
    if (forkRight) {
      forkRight = false;
      rightSplit = leftSplit;
      taskToFork = task;
      task = leftTask;
    }
    else {
      forkRight = true;
      taskToFork = leftTask;
    }
    taskToFork.fork();
    sizeEstimate = rightSplit.estimateSize();
  }
  task.spliterator = null;
  task.propagateCompletion();
}

On a high level of description, the main loop keeps breaking down the spliterator, alternately forking off the processing of the chunk and processing it inline, until the spliterator refuses to split further or the remaining size is below the computed threshold.

Now consider the above algorithm in the case of unsized streams, where the whole is not being split into roughly equal halves; instead chunks of predetermined size are being repeatedly taken from the head of the stream. In this case the "suggested target size" of the chunk is abnormally large, which basically means that the chunks are never re-split into smaller ones.

The algorithm would therefore appear to alternately fork off one chunk, then process one inline. If each chunk takes the same time to process, this should result in no more than two cores being used. However, the actual behavior is that all four cores on my machine are occupied. Obviously, I am missing an important piece of the puzzle with that algorithm.

What is it that I'm missing?

Appendix: test code

Here is a piece of self-contained code which may be used to test the behavior which is the subject of this question:

package test;

import static java.util.concurrent.TimeUnit.NANOSECONDS;
import static java.util.concurrent.TimeUnit.SECONDS;
import static test.FixedBatchSpliteratorWrapper.withFixedSplits;

import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;

public class Parallelization {
  static final AtomicLong totalTime = new AtomicLong();
  static final ExecutorService pool = Executors.newFixedThreadPool(4);

  public static void main(String[] args) throws IOException {
    final long start = System.nanoTime();
    final Path inputPath = createInput();
    System.out.println("Start processing");
    try (PrintWriter w = new PrintWriter(Files.newBufferedWriter(Paths.get("output.txt")))) {
      withFixedSplits(Files.newBufferedReader(inputPath).lines(), 200).map(Parallelization::processLine)
      .forEach(w::println);
    }
    final double cpuTime = totalTime.get(), realTime = System.nanoTime() - start;
    final int cores = Runtime.getRuntime().availableProcessors();
    System.out.println("          Cores: " + cores);
    System.out.format("       CPU time: %.2f s
", cpuTime / SECONDS.toNanos(1));
    System.out.format("      Real time: %.2f s
", realTime / SECONDS.toNanos(1));
    System.out.format("CPU utilization: %.2f%%", 100.0 * cpuTime / realTime / cores);
  }
  private static String processLine(String line) {
    final long localStart = System.nanoTime();
    double ret = 0;
    for (int i = 0; i < line.length(); i++)
      for (int j = 0; j < line.length(); j++)
        ret += Math.pow(line.charAt(i), line.charAt(j) / 32.0);
    final long took = System.nanoTime() - localStart;
    totalTime.getAndAdd(took);
    return NANOSECONDS.toMillis(took) + " " + ret;
  }
  private static Path createInput() throws IOException {
    final Path inputPath = Paths.get("input.txt");
    try (PrintWriter w = new PrintWriter(Files.newBufferedWriter(inputPath))) {
      for (int i = 0; i < 6_000; i++) {
        final String text = String.valueOf(System.nanoTime());
        for (int j = 0; j < 20; j++)
          w.print(text);
        w.println();
      }
    }
    return inputPath;
  }
}

package test;

import static java.util.Spliterators.spliterator;
import static java.util.stream.StreamSupport.stream;

import java.util.Comparator;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.stream.Stream;

public class FixedBatchSpliteratorWrapper<T> implements Spliterator<T> {
  private final Spliterator<T> spliterator;
  private final int batchSize;
  private final int characteristics;
  private long est;

  public FixedBatchSpliteratorWrapper(Spliterator<T> toWrap, long est, int batchSize) {
    final int c = toWrap.characteristics();
    this.characteristics = (c & SIZED) != 0 ? c | SUBSIZED : c;
    this.spliterator = toWrap;
    this.batchSize = batchSize;
    this.est = est;
  }
  public FixedBatchSpliteratorWrapper(Spliterator<T> toWrap, int batchSize) {
    this(toWrap, toWrap.estimateSize(), batchSize);
  }

  public static <T> Stream<T> withFixedSplits(Stream<T> in, int batchSize) {
    return stream(new FixedBatchSpliteratorWrapper<>(in.spliterator(), batchSize), true);
  }

  @Override public Spliterator<T> trySplit() {
    final HoldingConsumer<T> holder = new HoldingConsumer<>();
    if (!spliterator.tryAdvance(holder)) return null;
    final Object[] a = new Object[batchSize];
    int j = 0;
    do a[j] = holder.value; while (++j < batchSize && tryAdvance(holder));
    if (est != Long.MAX_VALUE) est -= j;
    return spliterator(a, 0, j, characteristics());
  }
  @Override public boolean tryAdvance(Consumer<? super T> action) {
    return spliterator.tryAdvance(action);
  }
  @Override public void forEachRemaining(Consumer<? super T> action) {
    spliterator.forEachRemaining(action);
  }
  @Override public Comparator<? super T> getComparator() {
    if (hasCharacteristics(SORTED)) return null;
    throw new IllegalStateException();
  }
  @Override public long estimateSize() { return est; }
  @Override public int characteristics() { return characteristics; }

  static final class HoldingConsumer<T> implements Consumer<T> {
    Object value;
    @Override public void accept(T value) { this.value = value; }
  }
}

Marko Topolnik · Accepted Answer

Ironically, the answer is almost stated in the question: as the "left" and "right" task take turns at being forked vs. processed inline, half of the time the right task, represented by this, e.g. the complete rest of the stream, is being forked off. That means that the forking off of chunks is just slowed down a bit (happening every other time), but clearly it happens.

Understanding the main loop in Streams API's ForEachTask

Tags:

java

java-8

java-stream

Appendix: test code

Marko Topolnik

1 Answers

Marko Topolnik

Recent Activity

Donate For Us

Understanding the main loop in Streams API's ForEachTask

Tags:

java

java-8

java-stream

Appendix: test code

Marko Topolnik

1 Answers

Marko Topolnik

Related questions

Recent Activity

Donate For Us