Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux: Managing virtual memory mapping within my process for fast emulation

Recently it occurred to me that a lot of emulators are slow because they have to simulate not just the CPU but also the memory of the emulated device. When the device has memory-mapped I/O, virtual memory, or just unused address space, then every memory access has to be simulated in software.

I feel like it might be a lot faster if the OS did this for us, by means of virtual memory. I'll use Game Boy emulation as an example for simplicity's sake but obviously this method would be better for newer, more powerful machines.

The Game Boy memory map is roughly:

  • 0x0000 - 0x7FFF: Mapped to cartridge ROM
    • Most cartridges have 0x0000 - 0x3FFF fixed and 0x4000 - 0x7FFF bank-switchable by writing to 0x2000
  • 0x8000 - 0x9FFF: Video RAM (only accessible when not currently rendering)
  • 0xA000 - 0xBFFF: Mapped to cartridge (usually battery-backed RAM)
  • 0xC000 - 0xDFFF: Internal RAM (0xD000 - 0xDFFF is bankswitched on GB Color)
  • 0xE000 - 0xFDFF: Mirror of internal RAM
  • 0xFE00 - 0xFE9F: Object Attribute Memory (sprite RAM)
  • 0xFEA0 - 0xFEFF: Unmapped (open bus or something, unsure)
  • 0xFF00 - 0xFF7F: Memory-mapped I/O (sound system, video control, etc)
  • 0xFE80 - 0xFFFF: Internal RAM

So a traditional emulator has to translate every memory access something like:

if(addr < 0x4000) return rom[addr];
else if(addr < 0x8000) return rom[(addr - 0x4000) + (0x4000 * cur_rom_bank)];
else if(addr < 0xA000) {
    if(vram_accessible) return vram[addr - 0x8000];
    else return 0xFF;
}
else if(addr < 0xC000) return saveram[addr - 0xA000];
else if(addr < 0xE000) return ram[addr - 0xC000];
else if(addr < 0xFE00) return ram[addr - 0xE000];
else if(addr < 0xFE9F) return oam[addr - 0xFE00];
else if(addr < 0xFF00) return 0xFF; //or whatever should be here
else if(addr < 0xFF80) return handle_io_read(addr);
else return hram[addr - 0xFF80];

Obviously that can be optimized by using a switch or table, but still it's a lot of code to run for every memory access. We could potentially improve the emulation speed quite a bit by mapping some pages to those addresses in our process's memory map:

  • 0x0000 - 0x3FFF: R-- (no Exec flag because native CPU doesn't execute it)
  • 0x4000 - 0x7FFF: R--
  • 0x8000 - 0x9FFF: ---
  • 0xA000 - 0xBFFF: ---
  • 0xC000 - 0xDFFF: RW-
  • 0xE000 - 0xFDFF: RW- (and mapped to same physical page as 0xC000 - 0xDFFF)
  • 0xFE00 - 0xFE9F: ---
  • 0xFEA0 - 0xFEFF: ---
  • 0xFF00 - 0xFF7F: ---
  • 0xFF80 - 0xFFFF: RW-

Then handle the SIGSEGV (or whatever signal would be generated) we get when accessing those pages. So a read from ROM or a write to RAM can just be performed directly, and a write to ROM will raise an exception which we can handle. We can change the permissions of VRAM (0x8000 - 0x9FFF) to be RW- when it should be accessible and --- when it shouldn't. In theory it could be much faster since it doesn't require the emulator to manually map every memory access in software.

I know that I can use mmap() to map pages at fixed addresses with various permissions. What I don't know is:

  • Can the mappings overlap, with different permissions?
  • Can I map pages to arbitrary addresses like this, regardless of the system's page size? Can I map to address 0?
  • How to change which memory a mapping points to? (eg when ROM bank is changed, we can just switch what memory is mapped at 0x4000 - 0x7FFF, but how do I do that?)
  • In a real-world case where the emulated system has a 32- or 64-bit CPU, can I map the entire first 4GB, or potentially the entire memory space? How would I avoid conflicting with whatever is already mapped (eg libraries, my stack, the kernel)?
  • Would this really be any faster? Or does throwing and catching a SIGSEGV generate more overhead than doing it the traditional way?
  • If it's not possible to do this in userspace, does Linux maybe provide a way to "take over" the kernel and do it there? So I could at least create an "emulator OS" which runs bare-metal while still having some Linux kernel facilities (such as video and filesystem drivers) available?
like image 577
Rena Avatar asked Dec 30 '15 01:12

Rena


1 Answers

I'd expect generating a SIGSEGV, catching it, handling it, and resuming, would have more perf overhead than on the original hardware, so arrange for it to only happen when there's actually an error that can be slow.

This is a nice technique for memory protection / array bounds checking when violations are rare, and it's ok if they're slow. Speeding up the common case a bit is a win, even if it makes the exceptional case much slower, is a win when the exceptional case doesn't happen in normal emulated code.

I've heard of Javascript emulators doing this to get cheaper array bounds checking: allocate an array so it ends at the top of a page, where the next page is unmapped.


Take this with a grain of salt: I haven't used any of this in code I've written. I just just heard about it and think I understand how it works and some of the implications.

Hopefully this will get you started looking at docs that will tell you what actually can be done.

Updating page tables is fairly slow. Try to find a balance where you can take advantage of user-space memory protection for some of the checks, but you aren't constantly mapping/unmapping pages from your memory space during the "common case" of what your emulated code does. Predicted branches run really fast, esp. if they're predicted not taken.

I've seen Linux kernel discussion / notes indicating that playing tricks with mmap isn't worth it over just memcpy of a single page. For larger block of memory, or less checking on repeated accesses, the benefit will outweigh the setup overhead.


You'll want to use mprotect(2) to change the permissions on (ranges of) pages. No, mappings can't overlap. See the MAP_FIXED option in mmap(2):

If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded.

IDK if you can do anything useful with x86 segment registers when accessing emulated memory, to map guest address 0 to some other address in your process's virtual address space. You can map virtual address 0, but by default Linux disables it so that NULL-pointer dereferences don't silently work!

Users of your software will have to futz with sysctl (same as for WINE) to enable it:

# Ubuntu's /etc/sysctl.d/10-zeropage.conf
# Protect the zero page of memory from userspace mmap to prevent kernel
# NULL-dereference attacks against potential future kernel security
# vulnerabilities.  (Added in kernel 2.6.23.)
#
# While this default is built into the Ubuntu kernel, there is no way to
# restore the kernel default if the value is changed during runtime; for
# example via package removal (e.g. wine, dosemu).  Therefore, this value
# is reset to the secure default each time the sysctl values are loaded.
vm.mmap_min_addr = 65536

Like I said, you can maybe use a segment register override on all loads/stores into guest (emulated-machine) memory, to remap it to a more reasonable page. Or maybe just use a constant offset of 64kiB (or more, to maybe put it above the text/data/bss (heap) of the emulation software. Or a non-constant offset using a pointer to the base of your mmapped guest-memory region, so everything is relative to a global variable. With gcc, this might be a good candidate for requesting that gcc keep that global in a register across all your functions. IDK, you'd have to see if that helped perf or not. A constant offset would end up making every instruction accessing guest memory need a 32b displacement field in the addressing mode, rather than 0 or 8b.

A segment register, if it works the way I think it does (as a constant offset you can apply with a segment-override prefix, instead of a 32b displacement modifier) would be much harder to get the compiler to generate, AFAIK. If it was just loads/stores, that would be one thing: you could use an inline asm wrapper for a load and store insn. But for efficient x86 code, all kinds of ALU instructions should use memory operands to reduce frontend bottlenecks via micro-fusion.

You could maybe just define a global char *const guest_mem = (void*)0x2000000; or something, and then use mmap with MAP_FIXED to force mapping memory there? Then guest memory accesses can compile to more efficient one-register addresisng modes.

like image 56
Peter Cordes Avatar answered Sep 30 '22 19:09

Peter Cordes