Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing registers in a C virtual machine

I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)

When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.

Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?

I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.

like image 220
jakogut Avatar asked Jan 25 '11 06:01

jakogut


2 Answers

  1. Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.

  2. If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.

  3. Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.

  4. The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).

like image 186
zeuxcg Avatar answered Oct 19 '22 03:10

zeuxcg


Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.

To get performance you need to design for performance up front.

A few examples.

Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.

Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:

al ^= 0x2E; of =0; cf=0; sf=al

Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.

Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.

Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.

Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.

like image 45
old_timer Avatar answered Oct 19 '22 01:10

old_timer