The following critical loop of a piece of numerical software, written in C++, basically compares two objects by one of their members:
for(int j=n;--j>0;) asd[j%16]=a.e<b.e;
a
and b
are of class ASD
:
struct ASD { float e; ... };
I was investigating the effect of putting this comparison in a lightweight member function:
bool test(const ASD& y)const { return e<y.e; }
and using it like this:
for(int j=n;--j>0;) asd[j%16]=a.test(b);
The compiler is inlining this function, but the problem is, that the assembly code will be different and cause >10% of runtime overhead. I have to question:
Why is the compiler prodrucing different assembly code?
Why is the produced assembly slower?
EDIT: The second question has been answered by implementing @KamyarSouri's suggestion (j%16). The assembly code now looks almost identical (see http://pastebin.com/diff.php?i=yqXedtPm). The only differences are the lines 18, 33, 48:
000646F9 movzx edx,dl
This chart shows the FLOP/s (up to a scaling factor) for 50 testruns of my code.
The gnuplot script to generate the plot: http://pastebin.com/8amNqya7
Compiler Options:
/Zi /W3 /WX- /MP /Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /MT /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Gd /analyze-
Linker Options: /INCREMENTAL:NO "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE
There are multiple reasons for inlining to be faster, only one of which is obvious: No jump instructions. better localization, resulting in better cache utilization. more chances for the compiler's optimizer to make optimizations, leaving values in registers for example.
Inline functions are faster because you don't need to push and pop things on/off the stack like parameters and the return address; however, it does make your binary slightly larger.
In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function.
Inline function expansion can speed up execution by eliminating function call overhead. This is particularly beneficial for very small functions that are called frequently. Function inlining involves a tradeoff between execution speed and code size, because the code is duplicated at each function call site.
Your asd
array is declared as this:
int *asd=new int[16];
Therefore, use int
as the return type rather than bool.
Alternatively, change the array type to bool
.
In any case, make the return type of the test
function match the type of the array.
Skip to bottom for more details.
In the manually inlined version, the "core" of one iteration looks like this:
xor eax,eax mov edx,ecx and edx,0Fh mov dword ptr [ebp+edx*4],eax mov eax,dword ptr [esp+1Ch] movss xmm0,dword ptr [eax] movss xmm1,dword ptr [edi] cvtps2pd xmm0,xmm0 cvtps2pd xmm1,xmm1 comisd xmm1,xmm0
The compiler inlined version is completely identical except for the first instruction.
Where instead of:
xor eax,eax
it has:
xor eax,eax movzx edx,al
Okay, so it's one extra instruction. They both do the same - zeroing a register. This is the only difference that I see...
The movzx
instruction has a single-cycle latency and 0.33
cycle reciprocal throughput on all the newer architectures. So I can't imagine how this could make a 10% difference.
In both cases, the result of the zeroing is used only 3 instructions later. So it's very possible that this could be on the critical path of execution.
While I'm not an Intel engineer, here's my guess:
Most modern processors deal with zeroing operations (such as xor eax,eax
) via register renaming to a bank of zero registers. It completely bypasses the execution units. However, it's possible that this special handling could cause a pipeline bubble when the (partial) register is accessed via movzx edi,al
.
Furthermore, there's also a false dependency on eax
in the compiler inlined version:
movzx edx,al mov eax,ecx // False dependency on "eax".
Whether or not the out-of-order execution is able to resolve this is beyond me.
Here I'll to explain why that extra movzx
is generated as well as why it stays.
The key here is the bool
return value. Apparently, bool
datatypes are probably as stored 8-bit values inside the MSVC internal-representation. Therefore when you implicitly convert from bool
to int
here:
asd[j%16] = a.test(b); ^^^^^^^^^ ^^^^^^^^^ type int type bool
there is an 8-bit -> 32-bit integer promotion. This is the reason why MSVC generates the movzx
instruction.
When the inlining is done manually, the compiler has enough information to optimize out this conversion and keeps everything as a 32-bit datatype IR.
However, when the code is put into it's own function with a bool
return value, the compiler is not able to optimize out the 8-bit intermediate datatype. Therefore, the movzx
stays.
When you make both datatypes the same (either int
or bool
), no conversion is needed. Hence the problem is avoided altogether.
lea esp,[esp]
occupies 7 bytes of i-cache and it's inside the loop. A few other clues make it look like the compiler isn't sure if this is a release build or a debug build.
Edit:
The lea esp,[esp]
isn't in the loop. The position among the surrounding instructions misled me. Now it looks like it intentionally wasted 7 bytes, followed by another wasted 2 bytes, in order to start the actual loop at a 16-byte boundary. Which means that this actually speeds things up, as observed by Johennes Gerer.
The compiler still seems to be uncertain whether this is a debug or release build though.
Another edit:
The pastebin diff is different from the pastebin diff that I saw earlier. This answer could be deleted now, but it already has comments so I'll leave it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With