We are writing a byte-code for a high-level compiled language, and after a bit of profiling and optimization, it became clear that the current largest performance overhead is the switch statement we're using to jump to the byte-code cases.
We investigated pulling out the address of each case label and storing it in the stream of byte-code itself, rather than the instruction ID that we usually switch on. If we do that, we can skip the jump table, and directly jump to the location of code of the currently executing instruction. This works fantastically in GCC, however, MSVC doesn't seem to support a feature like this.
We attempted to use inline assembly to grab the address of the labels (and to jump to them), and it works, however, using inline assembly causes the entire function to be avoided by the MSVC optimizer.
Is there a way to allow the optimizer to still run over the code? Unfortunately, we can't extract the inline assembly into another function other than the one that the labels were made in, since there's no way to reference a label for another function even in inline assembly. Any thoughts or ideas? Your input is much appreciated, thanks!
The only way of doing this in MSVC is by using inline assembly (which basically buggers you for x64):
int _tmain(int argc, _TCHAR* argv[])
{
case_1:
    void* p;
    __asm{ mov [p],offset case_1 }
    printf("0x%p\n",p);
    return 0;
}
If you plan on doing something like this, then the best way would be to write the whole interpreter in assembly then link that in to the main binary via the linker (this is what LuaJIT did, and it is the main reason the VM is so blindingly fast, when its not running JIT'ed code that is).
LuaJIT is open-source, so you might pick up some tips from it if you go that route. Alternatively you might want to look into the source of forth (whose creator developed the principle you're trying to use), if there is an MSVC build you can see how they accomplished it, else you're stuck with GCC (which isn't a bad thing, it works on all major platforms).
Take a look at what Erlang does for building on Windows. They use MSVC for most of the build, and then GCC for one file to make use of the labels-as-values extension. The resulting object code is then hacked to be made compatible with the MSVC linker.
http://www.erlang.org/doc/installation_guide/INSTALL-WIN32.html
It seems you could just move the actual code to functions, instead of case labels. The byte code can then be trivially transformed into direct calls. I.e. byte code 1 would translate to CALL BC1. Since you're generating direct calls, you don't have the overhead of function pointers. The pipelines of most CPU's can follow such unconditional direct branches.
As a result, the actual implementations of each byte code are optimized, and the conversion from byte code to machince code is a trivial 1:1 conversion. You get a bit of code expansion since each CALL is 5 bytes (assuming x86-32) but that's unlikely to be a major problem.  
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With