Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does emulation have to be done in real time?

I understand what an emulator does, changing one machine language into another, often "Just-in-time." Could such a program be built in that it reads in a binary that is written for one architecture and saves a new binary for another architecture. After the process completes, it would leave the user with binary ready for native execution on the given architecture. This would be especially useful for those that have expensive proprietary applications for a legacy architecture.

Is it possible to make such an application? Binary recompilation is not a new concept, but I have yet to find any useful implementations of such.

With the help of some others, I would be thrilled to start coding on an open source implementation of such a program, if the programming of such is a possibility.

like image 885
user1483857 Avatar asked Jun 26 '12 20:06

user1483857


1 Answers

Static recompilation is a promising way of translating binaries from a foreign architecture to another target architecture. It would be faster than Just-In-Time (JIT), because it doesn't have to compile the code right before running, and because the extra compilation time it could take is useful to optimize the generate code.

However JIT compilation uses dynamic program analysis, while static recompilation relies on static program analysis (hence the name).

In static analysis, you don't have runtime information on an execution.

A major problem with this is posed by indirect jumps. The term covers code that might be generated from certain switch statements, from the use of function pointers or from runtime polymorphism (think virtual table). It all boils down to an instruction of the form:

JMP reg_A 

Let's say you know the start address of your program, and you decided start to recompile instructions from this point. When you encounter a direct jump, you go to its target address, and you continue the recompilation from there. When you encounter an indirect jump though, you're stuck. In this assembly instruction, the content of reg_A is not known statically. Therefore, we don't know the address of the next instruction. Note that in dynamic recompilation, we don't have this problem, because we emulate the virtual state of the registers, and we know the current content of reg_A. Besides, in static recompilation, you are interested in finding all possible values for reg_A at this point, because you want to have all possible paths compiled. In dynamic analysis, you only need the current value to generate the path that you're currently executing, should reg_A change its value, you would still be able to generate the other paths. In some cases, static analysis can find a list of candidates (if it is a switch there must be a table of possible offset somewhere), but in the general case we simply don't know.

Fine, you say, let's recompile all instructions in the binary then!

The problem here is that in most binaries contain both code and data. Depending on the architecture, you might not be able to tell which is which.

Worse, in some architectures there are no alignment constraints and variable width instructions, and you may start to dissassemble at some point, only to discover that you've started you recompilation with an offset.

Let's take a simplified instruction set comprising two instructions and a single register A:

41 xx (size 2): Add xx to `A`.
42 (size 1): Increment `A` by one.

Let's take the following binary program:

41 42

Let's say the start point is the first byte 41. You do:

 41 42 (size 2): Add 42 to `A`.

But what if 41 is a piece of data? Then your program becomes:

 42 (size 1): Increment `A` by one.

This problem is magnified in old games, which were often optimised directly in assembly, and where the programmer might intentionally expect some byte to be interpreted as both code and data, depending on the context!

Even worse, the recompiled program could be generating code itself! Imagine recompiling a JIT compiler. The result would still output code for the source architecture and try to jump to it, most likely causing the program to die very soon. Statically recompiling code that is only available at runtime requires infinite trickery!

Static binary analysis is a very live area of research (mainly in the field of security, to look for vulnerabilities in systems whose sources are not available), and actually I know of an attempt to produce a NES emulator that tries to statically recompile programs. The article is very interesting.

A compromise between JIT and static recompilation would be to statically recompile as much code as possible, keeping only the bits that cannot be translated statically.

like image 178
dureuill Avatar answered Sep 25 '22 21:09

dureuill