I'm looking for an algorithm which to optimise the order of evaluations on a DAG, such that least memory is used. It might be a bit harder to explain, so I'll give an example what I mean. Consider that you have a DAG with multiple roots, which represents some form of dependency evaluation order. So each child node can perform its action only after its parents have performed. Additionally, we can release from memory each node which we do not need any more. The task is to find the optimal sequential schedule of evaluation such that least memory is used at any time. For instance consider the graph below:
And the two schedules:
load A - 1 node in memory
load B - 2
eval C - 3
eval D - 4
eval F - 5
unload C - 4
eval H - 5
unload A,F - 3
eval E - 4
eval G - 5
unload D,E - 3
eval I - 4
unload B,G - 2
eval J - 3
unload H,I
Maximum memory trace - 5
And this one:
load A - 1 node in memory
load B - 2
eval C - 3
eval D - 4
eval E - 5
eval F - 6
unload C - 5
eval G - 6
unload D,E - 4
eval H - 5
unload A,F - 3
eval I - 4
unload B,G - 2
eval J - 3
unload H,I - 1
unload C - 4
eval H - 5
unload A,F - 3
eval E - 4
eval G - 5
unload D,E - 3
eval I - 4
unload B,G - 2
eval J - 3
unload H,I
Maximum memory trace - 6
Assuming that all nodes occupy same memory, is there an algorithm which does this optimally? The first one is sort of like a Depth First traversal, while the second one is like Breath First traversal, but I'm not aware if those are optimal and why.
PS:
Just to clarify, as @Evgeny Kluev pointed out in his comment, this is very similar to Register Allocation, which can be solved using heuristic greedy graph colouring algorithms efficiently. However, register allocation is a simpler problem, since it assumes that you know the order of computation and thus can calculate the liveness of each variable. After this you can easily build the Inference graph and do the graph colouring. In our case we want to that as well as optimise the order of computation. This of course requires some assumptions, such as that we have no pointers and only basic data structures (which is what my Nodes represent). Obviously, since graph colouring is NP-complete, than this problem is at least NP-complete. What I'm looking for is for some sort of greedy/heuristic algorithm which to give a good solution on some not too degenerate cases.
Actually, in some sense this problem is easier than the graph coloring problem, because the decision version of this problem is a special case of the decision version of the Combined Register Allocation and Instruction Scheduling Problem (CRISP), which is simpler to solve than the graph coloring problem (at least when the exact solution is not required).
The decision version of this problem can be formulated as Whether a schedule using at most m memory slots exists?
Please note, that below I will use the terminology from the referenced article.
For each node v in your problem let's introduce virtual register rv and instruction xv in CRISP, the instruction writes to register rv and reads each register ru corresponding to parent node u of node v. Also, for each edge (u, v) of the DAG we introduce edge (xu, xv) in the dependency graph of CRISP. The execution time of each instruction equals to 1, the latency of each dependency edge equals to 0, and the spill cost equals to ∞. The number of available physical registers equals to m.
If there is a schedule in CRISP of the time length at most the number of nodes, then there is the corresponding schedule in the original problem that uses at most m memory slots. We are done.
The reduction presented above assumes that the memory used by a parent can be reused by the children when the parent is no longer needed. When it is not allowed, the following modifications are required:
Add one more instruction, yv, for each node v. Now, xv only writes rv, and yv reads each ru corresponding to parent u. Add dependency graph edge (xv, yv). Set the execution time of each instruction to 0.5. That's all.
Note that separation of register writing and reading between different instructions is required in order to prevent reuse of parent registers when it is not allowed.
The article referenced in the beginning describes an efficient algorithm to solve CRISP approximately. It can be used to solve this problem as well — just use the reduction above and try every m starting from 1 until a solution is found.
Notice also, that the algorithm is parameterized by two parameters: α and β, where α is the weight for controlling register pressure and β is the weight for controlling instruction parallelism. For this problem set α to 1 and β to 0, as the instruction parallelism is not needed for this problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With