(This question isn't intended to be specific to VirtualBox or x86 per se, but since they're the best examples I'm aware of, I'll be referencing them and asking how VBox handles some scenarios. If you're aware of other solutions that aren't used by VBox, do consider mentioning them as well.)
I've read how VirtualBox does software virtualization, and I don't understand the following.
Before executing ring 0 code, CSAM [Code Scanning and Analysis Manager] scans it recursively to discover problematic instructions. PATM [Patch Manager] then performs in-situ patching, i.e. it replaces the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler.
Consider the following sample instruction sequence in ring-0 code:
call foo
foo:
mov EAX, 1234
mov EDX, [ESP]
cmp EDX, EAX
jne bar
call do_something_special_if_return_address_was_1234
bar:
...
The callee here is testing to see if the return address to the caller is 1234
, and if it is, it does something special. Clearly patching will change the return address, so we need to be able to handle it.
VirtualBox's documentation says it discovers "problematic" instructions and patches them in-place, but I don't really understand how this can work, for 2 reasons:
It appears that any instruction that exposes the instruction pointer is "problematic", of which call
is probably the most common (and extremely so). Does this mean VirtualBox must analyze, and potentially patch, every call
instruction it sees in ring 0? Doesn't this make performance drop off a cliff? How do they handle this with high performance? (The cases they mention in their documentation are quite obscure, so I'm confused why they didn't mention such a common instruction if it occurs. And if it isn't a problem, I don't understand why.)
If the instruction stream happens to be modified (e.g. dynamically loading/unloading kernel modules), VirtualBox must dynamically detect this and garbage-collect unreachable recompiled instructions. Otherwise, it will have a memory leak. But this means every mov
instruction (and push
instruction, and everything else that writes to memory) will now have to be analyzed and potentially patched, possibly repeatedly, because it may be modifying code that was patched. This would seem to essentially degenerate all guest ring-0 code into near-full software emulation (since the target of a move is not known during recompilation), which would make the virtualization cost skyrocket, and yet this is not the impression I get from reading the documentation. Is this not a problem? How is this handled efficiently?
Please note that I am not asking about hardware-assisted virtualization like Intel VT or AMD-V and I am not interested in reading about those. I'm well aware that they avoid these problems altogether, but my question is about pure software virtualization.
At least for QEMU
, it seems like the answer is that even in the translated code, there is a separate emulated "stack" that is set up with the same values as the code would have when running natively, and this "stack" is the one read by the emulated code, which sees the same values as if it were running natively.
This means that emulated code can't be translated to use call
, ret
, or any other stack-using instructions directly, since those would not use the emulated stack. Therefore, those calls are replaced by jumps to various bits of thunk code that does the right thing in terms of calling the equivalent translated code.
The OP's (reasonable) assumption seems to be that call
and ret
instructions would appear in the translated binary, and the stack would reflect the addresses of the dynamically translated code. What actually happens (in QEMU) is that the call
and ret
instructions are removed and replaced with control flow that doesn't use the stack, and values on the stack are set to have the same values as they would in native code.
That is, the OP's mental model is that the result of the code translation is somewhat like the native code, with some patches and modifications. At least in the case of QEMU, that's not really the case—every basic block is heavily translated via the Tiny Code Generator (TCG), first to an intermediate representation and then to the target architecture (even if the source and destination archs are the same, as in my case). This deck has a good overview of a lot of the technical details, including an overview of TCG as shown below.
The resulting code is usually nothing like the input code, and typically increases in size by about 3 times. Registers are often used fairly sparingly and you often see back-to-back redundant sequences. Particularly relevant to this question, is that essentially all control flow instructions are quite differently, so ret
and call
instructions in the native code pretty much never translate into plain call
or ret
in the translated code.
An example: First, some C code with a return_address()
call that simply returns the return address, and a main()
that prints this function:
#include <stdlib.h>
#include <stdio.h>
__attribute__ ((noinline)) void* return_address() {
// stuff here?
return __builtin_return_address(0);
}
int main(int argc, char **argv) {
void *a = return_address();
printf("%p\n", a);
}
The noinline
here is important since otherwise gcc
just may just inline the function and hardcode the address directly into the assembly without making a call
or accessing the stack at all!
With gcc -g -O1 -march=native
this compiles to:
0000000000400546 <return_address>:
400546: 48 8b 04 24 mov rax,QWORD PTR [rsp]
40054a: c3 ret
000000000040054b <main>:
40054b: 48 83 ec 08 sub rsp,0x8
40054f: b8 00 00 00 00 mov eax,0x0
400554: e8 ed ff ff ff call 400546 <return_address>
400559: 48 89 c2 mov rdx,rax
40055c: be 04 06 40 00 mov esi,0x400604
400561: bf 01 00 00 00 mov edi,0x1
400566: b8 00 00 00 00 mov eax,0x0
40056b: e8 c0 fe ff ff call 400430 <__printf_chk@plt>
400570: b8 00 00 00 00 mov eax,0x0
400575: 48 83 c4 08 add rsp,0x8
400579: c3 ret
Note that return_address()
returns [rsp]
just like the OP's example. The main()
function sticks it in rdx
, where printf
will read it from.
We expect the caller's return address as seen by return_address
to be the instruction following the call, 0x400559
:
400554: e8 ed ff ff ff call 400546 <return_address>
400559: 48 89 c2 mov rdx,rax
... and indeed that's what we see when we run it natively:
person@host:~/dev/test-c$ ./qemu-test
0x400559
Lets try it in QEMU:
person@host:~/dev/test-c$ qemu-x86_64 ./qemu-test
0x400559
It works! Note that QEMU is by default translating all code and putting it far away from the usual native location (as we will see shortly), so we don't need any special instructions to trigger translation.
How does that work under the covers? We can use the -d in_asm,out_asm
option with QEMU to see what it is making of this code.
First, the code leading up to the call (the IN
portion is the native code, and the OUT
is what QEMU translates it to—sorry for AT&T syntax, I can't figure out how to change that in QEMU):
IN: main
0x000000000040054b: sub $0x8,%rsp
0x000000000040054f: mov $0x0,%eax
0x0000000000400554: callq 0x400546
OUT: [size=123]
0x557c9cf33a40: mov -0x8(%r14),%ebp
0x557c9cf33a44: test %ebp,%ebp
0x557c9cf33a46: jne 0x557c9cf33aac
0x557c9cf33a4c: mov 0x20(%r14),%rbp
0x557c9cf33a50: sub $0x8,%rbp
0x557c9cf33a54: mov %rbp,0x20(%r14)
0x557c9cf33a58: mov $0x8,%ebx
0x557c9cf33a5d: mov %rbx,0x98(%r14)
0x557c9cf33a64: mov %rbp,0x90(%r14)
0x557c9cf33a6b: xor %ebx,%ebx
0x557c9cf33a6d: mov %rbx,(%r14)
0x557c9cf33a70: sub $0x8,%rbp
0x557c9cf33a74: mov $0x400559,%ebx
0x557c9cf33a79: mov %rbx,0x0(%rbp)
0x557c9cf33a7d: mov %rbp,0x20(%r14)
0x557c9cf33a81: mov $0x11,%ebp
0x557c9cf33a86: mov %ebp,0xa8(%r14)
0x557c9cf33a8d: jmpq 0x557c9cf33a92
0x557c9cf33a92: movq $0x400546,0x80(%r14)
0x557c9cf33a9d: mov $0x7f177ad8a690,%rax
0x557c9cf33aa7: jmpq 0x557c9cef8196
0x557c9cf33aac: mov $0x7f177ad8a693,%rax
0x557c9cf33ab6: jmpq 0x557c9cef8196
A key part is here:
0x557c9cf33a74: mov $0x400559,%ebx
0x557c9cf33a79: mov %rbx,0x0(%rbp)
You can see that it is actually manually putting the return address from the native code onto the "stack" (which seems generally to be accessed with rbp
). Following that, note that there is no call
instruction to return_address
. Rather, we have:
0x557c9cf33a92: movq $0x400546,0x80(%r14)
0x557c9cf33a9d: mov $0x7f177ad8a690,%rax
0x557c9cf33aa7: jmpq 0x557c9cef8196
In most of the code, r14
seems to be a pointer to some internal QEMU data structure (i.e., not used to hold values from the emulated program). The above shoves 0x400546
(which is the address of the return_address
function in native code) into a field of the structure pointed to by r14
, sticks 0x7f177ad8a690
in rax
, and jumps to 0x557c9cef8196
. This last address appears all over the place in the generated code (but its definition does not) and seems to be some kind of internal dispatch or thunk method. Presumably, it uses either the native address or, more likely, the mystery value shoved in rax
to dispatch to the translated return_address
method, which looks like this:
----------------
IN: return_address
0x0000000000400546: mov (%rsp),%rax
0x000000000040054a: retq
OUT: [size=64]
0x55c131ef9ad0: mov -0x8(%r14),%ebp
0x55c131ef9ad4: test %ebp,%ebp
0x55c131ef9ad6: jne 0x55c131ef9b01
0x55c131ef9adc: mov 0x20(%r14),%rbp
0x55c131ef9ae0: mov 0x0(%rbp),%rbx
0x55c131ef9ae4: mov %rbx,(%r14)
0x55c131ef9ae7: mov 0x0(%rbp),%rbx
0x55c131ef9aeb: add $0x8,%rbp
0x55c131ef9aef: mov %rbp,0x20(%r14)
0x55c131ef9af3: mov %rbx,0x80(%r14)
0x55c131ef9afa: xor %eax,%eax
0x55c131ef9afc: jmpq 0x55c131ebe196
0x55c131ef9b01: mov $0x7f9ba51f7713,%rax
0x55c131ef9b0b: jmpq 0x55c131ebe196
The first bit of code seems to set up the user "stack" in ebp
(getting it from r14 + 0x20
, which is probably the emulated machine state structure) and ultimately reads from the "stack" (the line mov 0x0(%rbp),%rbx
) and stores that away into the area pointed to by r14
(mov %rbx,0x80(%r14)
).
Finally, it gets to jmpq 0x55c131ebe196
, which transfers to the QEMU epilogue routine:
0x55c131ebe196: add $0x488,%rsp
0x55c131ebe19d: pop %r15
0x55c131ebe19f: pop %r14
0x55c131ebe1a1: pop %r13
0x55c131ebe1a3: pop %r12
0x55c131ebe1a5: pop %rbx
0x55c131ebe1a6: pop %rbp
0x55c131ebe1a7: retq
Note that I use the word "stack" in quotes above. This is because this "stack" is the emulation of the stack as seen by the emulated program, not the true stack pointed to by rsp
. The true stack pointed to by rsp
is controlled by QEMU to implement the emulated control flow, and the emulated code doesn't access it directly.
We see above that "stack" contents as seen by the emulated process are the same under QEMU, but details of the stack do change. For example, the address of the stack looks different under emulation than natively (i.e., the value of rsp
rather than the stuff pointed to by [rsp]
).
This function:
__attribute__ ((noinline)) void* return_address() {
return __builtin_frame_address(0);
}
normally returns addresses like 0x7fffad33c100
but returns addresses like 0x40007ffd00
under QEMU. This should not be a problem, though, because no valid program should depend on the exact absolute value of the stack address. Not only is it generally not defined and unpredictable, but on recent operating systems it really is designed to be totally unpredictable due to stack ASLR (Linux and Windows both implement this). The program above returns a different address each time I run it natively (but the same address under QEMU).
You also mentioned a concern about when the instruction stream is modified, and gave an example of loading a kernel module. First, at least for QEMU, code is only translated "on demand". Functions that could be called, but aren't in some particular run, are never translated (you can try it with a function that is called conditionally depending on argc
). So in general, loading new code into the kernel, or into a process in user-mode emulation, is handled by the same mechanism: the code will simply be translated the first time it is called.
If the code is actually self modifying—i.e., the process writes to its own code—then something has to be done, since without help QEMU would keep using the old translation. So, to detect self-modifying code without penalizing every write to memory, the native code lives in pages with R+X permissions only. The consequence is that a write raises a GP fault which QEMU handles by noting that the code has modified itself, invalidating the translation, and so on. Plenty of details can be found in this thread and elsewhere.
It's a reasonable mechanism, and I expect other code-translation VMs do something similar.
Note that in the case of self-modifying code, the "garbage collection" problem is simple: the emulator is informed about the SMC event as described above, and since it has to re-translate at this point, it throws away the old translation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With