Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intelligent solution to computing jump addresses in a bytecode compiler?

Let's say I'm implementing a bytecode compiler, similar to Lua's/Python's... so on.

I'm traversing the AST, generating bytecode instructions, and I run into a break inside of an if-else inside of a while loop:

while (cond1)
    if (cond2)
        ...
    else
        break

(I tried writing out equivalent bytecode but it didn't look too helpful.)

The point is, there are at least 4 "jump" instructions in that example, and I want to find an elegant solution to filling in the jump addresses as I compile the AST... I don't know the jump address for the while loop or the break until after I've fully "compiled" the inner statements.

  1. A pseudocode solution would be ideal
  2. A solution shouldn't depend on whether I'm implementing a register- or stack-based bytecode compiler (I'm playing with both)

I'm not reading the dragon book just yet.

If I'm recursively compiling the AST, when I reach a break statement inside of some arbitrary number of loops and if-else blocks, how should the compiler know which empty label to jump to? I'm guessing some type of label-name stack external to the recursive AST walking function.

like image 869
joe Avatar asked Oct 26 '12 20:10

joe


People also ask

What is bytecode compiler?

Bytecode compiler for the BASIC programming language (written in Haskell) Welcome to key, the programming language. An Interpreter written in Go that processes a psuedo JavaScript language. A Python bytecode compiler written in Python.

What are bytecode instructions?

Bytecode instructions are oriented towards the high-level programming language, from which they are compiled. Often this makes the bytecode so similar to the program it originated from, that it might be possible to reconstruct the complete source code from the bytecode.

How is bytecode interpreted in Java?

This approach lets each system interpret the same bytecode files. The bytecode itself is in a binary format that consists of constants, references and numeric codes. The Java virtual machine interprets bytecode and converts it to machine language that is platform-specific.

What happens to the source code when it is compiled?

When source code is compiled, it is eventually converted to IL bytecode rather than to machine instruction assembly code, as an additional step in the code compilation and execution process. The runtime compiler's responsibility is to convert source code from languages such as Java, C#, VB.NET and others to IL bytecode.


1 Answers

The principle you need is called "backpatching": fill in a dummy value for the forward jump, generate the code for the body of the statement, and then go back and replace the dummy value with the real one at the end.

e.g.

# start of codegen for 'while'
L1:
  [evaluate cond1]
  jne L2   # forward reference: use a dummy value for now

# start of codegen for 'if ... else'
L3:
  [evaluate cond2]
  jne L4   # another forward reference: use a dummy value for now
  [code in 'if' branch]
  j L5     # another forward reference: use a dummy value for now
L4:
  [code in 'else' branch before 'break']
  j L2
  [code in 'else' branch after 'break']
L5:   # end of 'if ... else'
  # now go back and fill in references to L4, L5 inside the 'if ... else' block
  # end of codegen for 'if ... else'

  # codegen for 'while' continues...
  j L1   # loop
L2:   # end of 'while' loop
  # now go back and fill in references to L2 inside the 'while' block
  # end of codegen for 'while'

Regarding your edit:

If I'm recursively compiling the AST, when I reach a break statement inside of some arbitrary number of loops and if-else blocks, how should the compiler know which empty label to jump to? I'm guessing some type of label-name stack external to the recursive AST walking function.

The target of the jump which implements the break statement is the end of the innermost enclosing loop; yes, you could track that with an explicit external stack.

But, if you have a recursive AST walking function, you already have an implicit stack - the call frames of the recursive function invocations - so you probably don't need an explicit one as well.

e.g. in

...
while (outer)
    ...
    if (outer_done)
        break
    ...

    while (inner)
        ...
        if (inner_done)
            break
        ...
    [break from inner 'while' ends up here]

    ...
    if (outer_done_2)
        break
    ...
[break from outer 'while' ends up here]
...

the entirety of the code generation for the inner while loop happens within a recursive walk of the AST for the body of the outer while loop. Inside this recursive call, you only need to care about the inner loop, not the outer one.

So, all you need to do is:

  • save any current backpatching state at the start of processing a while
  • start a new empty list for tracking any break that appears within the body of this while
  • handle the body (which may result in a recursive call)
  • apply the backpatches
  • then restore the previous state before exiting.

e.g. something a bit like this:

codegen for while:
    save previous break backpatch list
    initialise break backpatch list as empty
    perform codegen for evaluating condition
    perform codegen for body statements
    apply backpatches
    restore previous break backpatch list

The current break backpatch list must be part of some state which is passed to all the codegen functions (the codegen for break needs to append to it). But the saved list could just be tracked as a local variable of the while codegen function.

like image 161
Matthew Slattery Avatar answered Jan 02 '23 11:01

Matthew Slattery