Assembly Performance Tuning

Question

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).

I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:

#include "stdafx.h"
#include <intrin.h>
#include <iostream>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
    __int64 startval;
    __int64 stopval;
    unsigned int value; // Keep the value to keep from it being optomized out

    startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode

    // Simple Math: a = (a << 3) + 0x0054E9
    _asm {
        mov ebx, 0x1E532 // Seed
        shl ebx, 3
        add ebx, 0x0054E9
        mov value, ebx
    }

    stopval = __rdtsc();
    __int64 val = (stopval - startval);
    cout << "Result: " << value << " -> " << val << endl;

    int i;
    cin >> i;

    return 0;
}

I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.

I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.

Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?

EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.

Jerry Coffin · Accepted Answer

To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.

You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.

Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.

As such, the normal timing sequence for your code would be something like this:

XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID            ; Intel says by the third execution, the timing will be stable.
RDTSC            ; read the clock
push eax         ; save the start time
push edx

    mov ebx, 0x1E532 // Seed // execute test sequence
    shl ebx, 3
    add ebx, 0x0054E9
    mov value, ebx

XOR EAX, EAX      ; serialize
CPUID   
rdtsc             ; get end time
pop ecx           ; get start time back
pop ebp
sub eax, ebp      ; find end-start
sbb edx, ecx

We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.

Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.

Assembly Performance Tuning

Tags:

c++

assembly

compiler-construction

compiler-theory

Nate Zaugg

Video Answer

1 Answers

Jerry Coffin

Recent Activity

Donate For Us

Assembly Performance Tuning

Tags:

c++

assembly

compiler-construction

compiler-theory

Nate Zaugg

Video Answer

1 Answers

Jerry Coffin

Related questions

Recent Activity

Donate For Us