Performance penalty with executing x86 instructions stored in the data segment?

Question

I have a simple program that first writes some native x86 instructions into a declared buffer, and then sets a function pointer to this buffer and makes a call. I'm noticing a severe performance penalty, however, when that buffer is allocated on the stack (as opposed to on the heap, or even in the globals data area). I verified that the start of the instruction sequence in the data buffer is on a 16-byte boundary (I'm assuming that's what the cpu requires (or wants) it to be). I don't know why it would make a difference where I execute my instructions in the process from, but in the program below, "GOOD" executes in 4 seconds on my dual-core workstation, and "BAD" takes 6 minutes or so. Is there some kind of alignment/i-cache/prediction issue going on here?? My evaluation license to VTune just ended, so I can't even run an analysis on this :(. Thanks.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef int (*funcPtrType)(int, int);

int foo(int a, int b) { return a + b; }

void main()
{
  // Instructions in buf are identical to what the compiler generated for "foo".
  char buf[201] = {0x55,
                   0x8b, 0xec,
                   0x8b, 0x45, 0x08,
                   0x03, 0x45, 0x0c,
                   0x5D,
                   0xc3
                  };

  int i;

  funcPtrType ptr;

#ifdef GOOD
  char* heapBuf = (char*)malloc(200);
  printf("Addr of heap buf: %x
", &heapBuf[0]);
  memcpy(heapBuf, buf, 200);
  ptr = (funcPtrType)(&heapBuf[0]);
#else // BAD
  printf("Addr of local buf: %x
", &buf[0]);
  ptr = (funcPtrType)(&buf[0]);
#endif

  for (i=0; i < 1000000000; i++)
    ptr(1,2);
}

The results from running this are:

$ cl -DGOOD ne3.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
Copyright (C) Microsoft Corp 1984-1997. All rights reserved.

ne3.cpp
Microsoft (R) 32-Bit Incremental Linker Version 5.10.7303
Copyright (C) Microsoft Corp 1992-1997. All rights reserved.

/out:ne3.exe
ne3.obj
$ time ./ne3
Addr of heap buf: 410eb0

real 0m 4.33s
user 0m 4.31s
sys 0m 0.01s
$
$
$ cl ne3.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
Copyright (C) Microsoft Corp 1984-1997. All rights reserved.

ne3.cpp
Microsoft (R) 32-Bit Incremental Linker Version 5.10.7303
Copyright (C) Microsoft Corp 1992-1997. All rights reserved.

/out:ne3.exe
ne3.obj
$ time ./ne3
Addr of local buf: 12feb0

real 6m41.19s
user 6m40.46s
sys 0m 0.03s
$

Thanks.

Shasank

DigitalRoss · Accepted Answer

Stack protection for security?

As a wild guess, you could be running into an MMU-based stack protection scheme. A number of security holes were based on deliberate buffer overruns, which inject executable code onto the stack. One way to fight these is with a non-executable stack. This would result in a trap into the OS, where I suppose it's possible that the OS or some virus SW does something.

Negative i-cache coherency interaction?

Another possibility is that using both code and data accesses to nearby addresses is defeating the CPU cache strategy. I believe x86 implements an essentially automatic code/data coherency model, which is likely to result in the invalidation of large amounts of nearby cached instructions on any memory write. You can't really fix this by changing your program to not use the stack (obviously you can move the dynamic code) because the stack is written by the machine code all the time, for example, whenever a parameter or return address is pushed for a procedure call.

The CPU's are really fast these days relative to the DRAM or even the outer level cache rings, so anything that defeats the inner cache rings will be quite serious, plus its implementation probably involves some sort of micro-trap within the CPU implementation, followed by a "loop" in HW to invalidate things. It isn't something Intel or AMD would have worried about speed on, since for most programs it would never happen and when it did it would normally only happen once after loading a program.

Artelius · Answer

My guess is that, since you have the variable i on the stack also, when you change i in your for loop, you trash the same cache line that the code is sitting in. Put the code in the middle of your buffer somewhere (and perhaps enlarge the buffer) so as to keep it separated from the other stack variables.

Also note that execution of instructions on the stack is usually the hallmark of a security exploit (such as a buffer overrun) being exploited.

Therefore the OS is often configured to disallow this behaviour. Virus scanners may take action against it as well. Perhaps your program is running through a security check each time it tries to access that stack page (though I'd expect the sys time field to be larger in that case).

If you want to "officially" make a memory page executable, you should probably look into VirtualProtect().

Performance penalty with executing x86 instructions stored in the data segment?

Tags:

performance

alignment

caching

x86

execution

intel

Shasank

2 Answers

Stack protection for security?

Negative i-cache coherency interaction?

DigitalRoss

Artelius

Recent Activity

Donate For Us

Performance penalty with executing x86 instructions stored in the data segment?

Tags:

performance

alignment

caching

x86

execution

intel

Shasank

2 Answers

Stack protection for security?

Negative i-cache coherency interaction?

DigitalRoss

Artelius

Related questions

Recent Activity

Donate For Us