I'm writing a decompiler for a simple scripting language. Here is what I've done:
Basic blocks
Created a collection of basic blocks as described here:
http://www.backerstreet.com/decompiler/basic_blocks.php
Control flow graph, dominator tree and loop sets
From this I could create the control flow graph.
http://www.backerstreet.com/decompiler/control_flow_graph.php
From the CFG I created the dominator tree, which in turn allowed me to find the loop sets in the CFG as described here:
http://www.backerstreet.com/decompiler/loop_analysis.php
Here is an image containing all of my data so far:
Structuring loops
My next step should be:
http://www.backerstreet.com/decompiler/creating_statements.php
And this is where my question comes in because I'm totally stuck. In my given data how would the structuring loops algorithm apply? I don't understand why it starts by trying to structure everything as a do while loop - in my example this means that "temp5_14 = temp5_14 + 16" in block 3 would always get executed at least once which is not what the original code does at all.
How could that work and how would the next pass of converting it from a do-while to a while loop actually work? For loop 3 that ends in block 6 this looks like it should be a while(true) - but how would this work with the algorithm when its head block is an if statement?
TL;DR - someone please explain with examples how the "structuring loops" algorithm actually works.
Structuring is the hardest part of decompiler development (at least for high level languages). That's a fairly simplistic algorithm, so it's a good starting point, but you'll likely want to use a better algorithm or make your own if you're working on a real decompiler.
With that out of the way, the answer to your actual question about how it can use do-while loops instead of while loops is already answered on the page you linked to.
Every loop can be described with a "do-while" statement.
The "while" loop (pre-tested loop) is a special case of a "do-while" loop, where the bottom condition is always true, and the first statement of the loop is an "if" that jumps out of the loop.
Say you had something like
beforeloop
while(foo) {
stmt1
stmt2
}
afterloop
It would be compiled to something along the lines of
beforeloop
LOOPBEGIN:
if !foo goto LOOPEND
stmt1
stmt2
goto LOOPBEGIN
LOOPEND:
afterloop
The decompiler algorithm converts this to
beforeloop
do {
if (!foo) {break}
stmt1
stmt2
} while (true)
afterloop
I hope that cleared that up. If not, feel free to ask about any other questions.
Edit: Example 2, showing how multiple loops with the same entry point can be collapsed.
for(;;) { while(foo) {} while(bar){} }
First off, for(;;)
is equivalent to while(true)
, so I'll use the following (pseudo)code instead
while(true) { while(foo) {stmt1} while(bar){stmt2} }
Let the outer loop be loop A and the inner loops be loop B and C. This compiles to something like the following pseudo assembly.
LOOP_A_BEGIN:
LOOP_B_BEGIN:
if !foo goto LOOP_B_END
stmt1
goto LOOP_B_BEGIN
LOOP_B_END:
LOOP_C_BEGIN:
if !bar goto LOOP_C_END
stmt2
goto LOOP_C_BEGIN
LOOP_C_END:
goto LOOP_A_BEGIN
But of course labels don't take up any space. So with identical labels collapsed, it becomes
POINT1:
if !foo goto POINT2
stmt1
goto POINT1
POINT2:
if !bar goto POINT3
stmt2
goto POINT2
POINT3
goto POINT1
Now, there are two points with backedges - point 1 and point 2. We can create one loop for each node, using labeled breaks for clarity. The transform isn't quite as straightforward since you have to mess with the if statements a bit, but it's still pretty easy.
LOOP1: while(true) {
IF1: if (!foo) {
break IF1;
}
else {
stmt1;
continue LOOP1;
}
LOOP2: while(true) {
if (!bar) {
break LOOP2;
}
else {
stmt2;
continue LOOP2;
}
}
continue LOOP1;
}
Now, the same code with unnecessary labels simplified out
while(true) {
if (!foo) {
}
else {
stmt1;
continue;
}
while(true) {
if (!bar) {
break;
}
else {
stmt2;
}
}
}
Now with if statements simplified
while(true) {
if (foo) {
stmt1;
continue;
}
while(true) {
if (!bar) {
break;
}
stmt2;
}
}
And finally, you can apply the while(true) if(!x)
transform to the inner loop. The outer loop can't be transformed like this, since it's not a simple while(cond) loop due to being the result of merged loops.
while(true) {
if (foo) {
stmt1;
continue;
}
while(bar) {
stmt2;
}
}
So hopefully, this demonstrates how you can always handle the case of multiple loops with the same entry point by merging them into a single loop, at the possible expense of having to rearrange some if statements too.
There's a paper called No More Gotos that suggests an algorithm for pattern-independent control flow structuring. It's a good chunk of work to understand and implement, but I'm using it for my senior project and it looks like it works.
My implementation is available on GitHub at zneak/fcd. The project uses LLVM IR as the intermediate representation.
EDIT At a very high level, this is how the algorithm works:
Just to be sure that these concepts are understood: a node A dominates a node Z if any path from the entry of the graph to Z has to go through A. Similarly, a node Z post-dominates a node A if any path from A to the end of the graph has to go through Z.
The algorithm structures regions. A region is a part of the graph that has a single entry edge and a single exit edge. I have personally extended this definition to basic blocks (regions have a single entry block and a single exit block), and made it so the exit block is excluded from the region.
The definition of a region that I use is:
The definition implies that the entry dominates every node through the exit node, and the exit node post-dominates every node from the entry.
A loop is a region with a "back edge" (a node with an edge going back to an already-visited node if you traversed your graph depth-first).
Make sure that loops are represented as single-entry single-exit regions with a single back-edge. That is, they should have only one entry node (to which the back edge also points) and a single successor. When it's not the case, you can introduce a new entry block and make all the edges point to it, then use a Φ-node to forward execution from there (in other words, introduce a variable that you set at the end of each incoming block, and do if (var == 0) { first entry } else if (var == 1) { second entry }
from the new block).
In my implementation, this happens in the StructurizeCFG pass, in the master branch, as of writing. However, it produces poor results because it works much harder than it needs to. I only need it to structurize loops, but it also structurizes if-else constructs, and while it doesn't break the algorithm, it introduces waaaay to many Φ-nodes to make for a pretty output. As of writing too, there's a branch called seseloop
with a custom pass to ensure that loops are single-entry single-exit. This pass doesn't touch if-else constructs if it doesn't need to.
Traverse the basic block graph in post-order. Identify regions starting at this block. You can use the post-dominator tree to speed this up a bit, as a region has to end with a block's post-dominator (so for each block, only check with the block's post-dominators).
If the entry block has a back-edge pointing to it, structurize it as a loop. If not, structurize it as a region. When a region has been structurized, put it back in your graph as a collapsed single node that can be included in another larger structurized region.
This happens in ast/ast_backend.cpp.
Use depth-first traversal (skipping cycles) on the region to identify the condition leading to the execution of any block. For instance:
Node A has no condition. Node B is reached if the condition at the end of node A is true. Node C is reached if it's false. Node D is reached if it was either true or false.
(A);
if (a_cond) (B);
if (!a_cond) (C);
if (a_cond || !a_cond) (D);
You then have to simplify these conditions, which is unfortunately a NP-complete problem. In general, it shouldn't be too hard to get back to something like A; if (a_cond) B; else C; D;
by collapsing a_cond || !a_cond
and comparing conditional terms in order.
You basically do the same thing as if you were structurizing a region without caring that it's a loop, but after that you add break statements (with conditions when relevant) at the end of blocks that can exit the loop and you wrap the region in a while true
block. Then, the authors have identified 6 patterns that can be replaced with more human-readable patterns (for instance, a while true
starting with an if (cond) break
can be transformed into a while !cond
).
And that's about it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With