Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java fork/join framework logic

This came up as a "side effect" to an an answer on another question today. It's more about curiosity than about an actual problem.

Java SE 7 offers what Oracle calls "the fork/join framework". It is a presumably superior way to schedule work to multiple processors. While I understand how it's supposed to work, I fail to understand the bit where it is superior and the claims made about work stealing.

Maybe someone else has more insight on why this approach would be desirable (other than because it has a fancy name).

The underlying primitives of fork/join are ForkJoinTasks, which are Futures, and the idea is to either perform work immediately [sic] (the wording is misleading as "immediately" implies that it happens synchronously in the main thread, in reality this happens inside a Future) below a certain threshold or divide work into two tasks recursively until the threshold is reached.

A future is a concept of encapsulating a task that runs asynchronously into an object in an opaque and unspecified way. You have a function that lets you verify whether a result is available, and you get a function that lets you (wait for, and) retrieve a result.
Strictly speaking, you do not even know whether a future runs asynchronously, it could execute inside get(). The implementation could in theory as well spawn a thread for every future or use a thread pool.
In practice, Java implements futures as tasks on a task queue, with a thread pool attached (the same is true for the entire fork/join framework).

The fork/join documentation gives this concrete usage example:

protected void compute() {
    if (mLength < sThreshold) {
        computeDirectly();
        return;
    }

    int split = mLength / 2;

    invokeAll(new ForkBlur(mSource, mStart, split, mDestination),
              new ForkBlur(mSource, mStart + split, mLength - split,
                           mDestination));
}

This submits tasks to the underlying threadpool's task queue in a manner indentical to how Mergesort would traverse them (thanks to recursion).
Say for example that we have an array of 32 "items" to process and have a threshold of 4, and split evenly, it would produce 8 tasks with 4 "items" each, and look like this:

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
                                               .
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15|16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
                       .                       .                       .
00 01 02 03 04 05 06 07|08 09 10 11 12 13 14 15|16 17 18 19 20 21 22 23|24 25 26 27 28 29 30 31
           .           .           .           .           .           .           .
00 01 02 03|04 05 06 07|08 09 10 11|12 13 14 15|16 17 18 19|20 21 22 23|24 25 26 27|28 29 30 31
------------------------------------------------------------------------------------------------
     1     |     2     |     3     |     4     |     5     |     6     |     7     |     8     | 

On a single-core processor, this will submit/execute (in a very complicated way) task groups 1-2-3-4-5-6-7-8 in order.
On a dual-core processor, this will submit/execute (1,3)-(2,4)-(5,7)-(6,8) [1].
On a quad-core processor, this will submit/execute (1,3,5,7)-(2,4,6,8).

In comparison, a naive implementation without all the superior magic would just submit the tasks 1-2-3-4-5-6-7-8 to the task queue right away. Always.

On a single-core processor, this would submit/execute 1-2-3-4-5-6-7-8.
On a dual-core processor, this would submit/execute (1,2)-(3,4)-(5,6)-(7,8).
On a quad-core processor, this would submit/execute (1,2,3,4)-(5,6,7,8).

Questions:

  1. Instead of simply cramming sThreshold consecutive items into one task and submitting one task after another to the thread pool's task queue, a tree-like recursion hierarchy is generated. This involves constructing, referencing, and destroying N+log2(N) objects for N sub-tasks that factually do nothing. Why is this superior?

  2. No locality of reference is preserved. Neither processor caches nor virtual memory like being treated like that. Why is this superior?

  3. Except on a uni-processor system, tasks are guaranteed not to be scheduled in an order that is anywhere close to their original order. This may be no issue if it really does not matter, but it makes things like e.g. a fence or barrier pretty much unfeasible. The only way of having something like a fence is waiting for the root object to complete and only submit new tasks after that. This is equivalent to a complete pipeline stall (which is exactly what you don't ever want to happen).

  4. The Oracle documentation claims that this approach implements work stealing and is therefore better than a thread pool. I don't see this happening. All I can see is a very complicated way of submitting tasks to a plain normal thread pool. How is this supposed to magically implement work stealing?


[1] Let's not make it too complicated and assume that worker threads don't outrun each other, tasks all take the same time to process. Otherwise, execution might of course happen in a different order, though submission would be the same.

like image 339
Damon Avatar asked Aug 31 '12 16:08

Damon


2 Answers

When you use an ExecutorService you will decide how many threads will be in the thread pool, and there is no kind of distinction between the tasks that you schedule and the subtasks that these tasks create.
ForkJoinPool class instead, manages threads based on 1)available processors and 2)task demand.
In this case, the subtasks created by the active tasks are being scheduled by different methods than the external tasks.
We typically have one fork-join pool for an entire application (unlike using the ExecutorService where it is typical to have more than 1 in any non-trivial application) and there is no need for shutdown.
I haven't reviewed the internals to give you a more low level explanation but if you see here there is a presentation and a benchmark showing measurements displaying the parallelism that is promised.

Update:
This framework addresses specific kind of problems (ExecutorService works better for tasks that have a mix of CPU and I/O activity).
The basic thinking here, is to use a recursive/divide and conquer approach in order to keep CPUs constantly busy. The idea is to create new tasks (forking) and suspend the current task until the new tasks complete (join) but without creating new threads and without having a shared work queue.
So Fork-join framework is implemented using work-stealing by creating a limited number of worker threads(as many as cores). Each worker thread maintains a private double-ended work queue.
When forking, worker pushes new task at the head of its deque. When waiting or idle, worker pops a task off the head of its deque and executes it instead of sleeping.
If worker’s deque is empty, steals an element off the tail of the deque of another randomly chosen worker.
I would recomend to read Data Parallelism in Java and also do some benchmarks yourself to be convinced. Theory is good only up to a point. After that do your measurements to see if there is significant performance edge or not

like image 111
Cratylus Avatar answered Nov 17 '22 18:11

Cratylus


Let me start with an article [yes I wrote it] that critiques the framework. A Java Fork-Join Calamity

Now to your questions:

  1. It's not. The framework wants to process a DAG. That's the design structure.

  2. It's not. As the article mentions, Java applications know nothing about caches, memory etc. so the assumptions are erroneous.

  3. Yes. That is exactly what happens. Stalls are so common that the framework needs to create "continuation threads" to keep moving. The article references a question here where over 700 continuation threads were needed.

  4. I certainly agree that the code is complex. Scatter-gather works much better than work-stealing for applications. As far as documentation, what documentation? There are no details from Oracle. Its all a push to use the framework.

There are alternatives.

like image 2
edharned Avatar answered Nov 17 '22 17:11

edharned