I have a simple function with an inner loop - it scales the input value, looks up an output value in a lookup table, and copies it to the destination. (ftol_ambient is a trick I copied from the web for fast conversion of float to int).
for (i = 0; i < iCount; ++i) { iScaled = ftol_ambient(*pSource * PRECISION3); if (iScaled <= 0) *pDestination = 0; else if (iScaled >= PRECISION3) *pDestination = 255; else { iSRGB = FloatToSRGBTable3[iScaled]; *pDestination = iSRGB; } pSource++; pDestination++; }
Now my lookup table is finite, and floats are infinite, so there's a possibility of off-by-one errors. I created a copy of the function with some code to handle that case. Notice that the only difference is the added 2 lines of code - please ignore the ugly pointer casting.
for (i = 0; i < iCount; ++i) { iScaled = ftol_ambient(*pSource * PRECISION3); if (iScaled <= 0) *pDestination = 0; else if (iScaled >= PRECISION3) *pDestination = 255; else { iSRGB = FloatToSRGBTable3[iScaled]; if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource)) ++iSRGB; *pDestination = (unsigned char) iSRGB; } pSource++; pDestination++; }
Here's the strange part. I'm testing both versions with identical input of 100000 elements, repeated 100 times. On my Athlon 64 1.8 GHz (32 bit mode), the first function takes 0.231 seconds, and the second (longer) function takes 0.185 seconds. Both functions are adjacent in the same source file, so there's no possibility of different compiler settings. I've run the tests many times, reversing the order they're run in, and the timings are roughly the same every time.
I know there's a lot of mystery in modern processors, but how is this possible?
Here for comparison are the relevant assembler outputs from the Microsoft VC++6 compiler.
; 173 : for (i = 0; i < iCount; ++i) $L4455: ; 174 : { ; 175 : iScaled = ftol_ambient(*pSource * PRECISION3); fld DWORD PTR [esi] fmul DWORD PTR __real@4@400b8000000000000000 fstp QWORD PTR $T5011[ebp] ; 170 : int i; ; 171 : int iScaled; ; 172 : unsigned int iSRGB; fld QWORD PTR $T5011[ebp] ; 173 : for (i = 0; i < iCount; ++i) fistp DWORD PTR _i$5009[ebp] ; 176 : if (iScaled <= 0) mov edx, DWORD PTR _i$5009[ebp] test edx, edx jg SHORT $L4458 ; 177 : *pDestination = 0; mov BYTE PTR [ecx], 0 ; 178 : else if (iScaled >= PRECISION3) jmp SHORT $L4461 $L4458: cmp edx, 4096 ; 00001000H jl SHORT $L4460 ; 179 : *pDestination = 255; mov BYTE PTR [ecx], 255 ; 000000ffH ; 180 : else jmp SHORT $L4461 $L4460: ; 181 : { ; 182 : iSRGB = FloatToSRGBTable3[iScaled]; ; 183 : *pDestination = (unsigned char) iSRGB; mov dl, BYTE PTR _FloatToSRGBTable3[edx] mov BYTE PTR [ecx], dl $L4461: ; 184 : } ; 185 : pSource++; add esi, 4 ; 186 : pDestination++; inc ecx dec edi jne SHORT $L4455
$L4472: ; 199 : { ; 200 : iScaled = ftol_ambient(*pSource * PRECISION3); fld DWORD PTR [esi] fmul DWORD PTR __real@4@400b8000000000000000 fstp QWORD PTR $T4865[ebp] ; 195 : int i; ; 196 : int iScaled; ; 197 : unsigned int iSRGB; fld QWORD PTR $T4865[ebp] ; 198 : for (i = 0; i < iCount; ++i) fistp DWORD PTR _i$4863[ebp] ; 201 : if (iScaled <= 0) mov edx, DWORD PTR _i$4863[ebp] test edx, edx jg SHORT $L4475 ; 202 : *pDestination = 0; mov BYTE PTR [edi], 0 ; 203 : else if (iScaled >= PRECISION3) jmp SHORT $L4478 $L4475: cmp edx, 4096 ; 00001000H jl SHORT $L4477 ; 204 : *pDestination = 255; mov BYTE PTR [edi], 255 ; 000000ffH ; 205 : else jmp SHORT $L4478 $L4477: ; 206 : { ; 207 : iSRGB = FloatToSRGBTable3[iScaled]; xor ecx, ecx mov cl, BYTE PTR _FloatToSRGBTable3[edx] ; 208 : if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource)) mov edx, DWORD PTR _SRGBCeiling[ecx*4] cmp edx, DWORD PTR [esi] jg SHORT $L4481 ; 209 : ++iSRGB; inc ecx $L4481: ; 210 : *pDestination = (unsigned char) iSRGB; mov BYTE PTR [edi], cl $L4478: ; 211 : } ; 212 : pSource++; add esi, 4 ; 213 : pDestination++; inc edi dec eax jne SHORT $L4472
int one = 1; int two = 2; if (one == two) ++iSRGB;
The run time of the first function is now down to 0.152 seconds. Interesting.
; 175 : for (i = 0; i < iCount; ++i) $L4457: ; 176 : { ; 177 : iScaled = ftol_ambient(*pSource * PRECISION3); fld DWORD PTR [edi] fmul DWORD PTR __real@4@400b8000000000000000 fstp QWORD PTR $T5014[ebp] ; 170 : int i; ; 171 : int iScaled; ; 172 : int one = 1; fld QWORD PTR $T5014[ebp] ; 173 : int two = 2; fistp DWORD PTR _i$5012[ebp] ; 178 : if (iScaled <= 0) mov esi, DWORD PTR _i$5012[ebp] test esi, esi jg SHORT $L4460 ; 179 : *pDestination = 0; mov BYTE PTR [edx], 0 ; 180 : else if (iScaled >= PRECISION3) jmp SHORT $L4463 $L4460: cmp esi, 4096 ; 00001000H jl SHORT $L4462 ; 181 : *pDestination = 255; mov BYTE PTR [edx], 255 ; 000000ffH ; 182 : else jmp SHORT $L4463 $L4462: ; 183 : { ; 184 : iSRGB = FloatToSRGBTable3[iScaled]; xor ecx, ecx mov cl, BYTE PTR _FloatToSRGBTable3[esi] ; 185 : if (one == two) ; 186 : ++iSRGB; ; 187 : *pDestination = (unsigned char) iSRGB; mov BYTE PTR [edx], cl $L4463: ; 188 : } ; 189 : pSource++; add edi, 4 ; 190 : pDestination++; inc edx dec eax jne SHORT $L4457
My guess is, that in the first case two different branches end up in the same branch-prediction slot on the CPU. If these two branches predict different each time the code will slow down.
In the second loop, the added code may just be enough to move one of the branches to a different branch prediction slot.
To be sure you can give the Intel VTune analyzer or the AMD CodeAnalyst tool a try. These tools will show you what's exactly going on in your code.
However, keep in mind that it's most probably not worth to optimize this code further. If you tune your code to be faster on your CPU it may at the same time become slower on a different brand.
EDIT:
If you want to read on the branch-prediction give Agner Fog's excellent web-site a try: http://www.agner.org/optimize/
This pdf explains the branch-prediction slot allocation in detail: http://www.agner.org/optimize/microarchitecture.pdf
My first guess is that the branch is being predicted better in the second case. Possibly because the nested if gives whatever algorithm the processor's using more information to guess from. Just out of curiousity, what happens when you remove the line
if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With