EDIT
I tested release in 32 bit, and the code was compact. Therefore the below is a 64 bit issue.
I'm using VS 2012 RC. Debug is 32 bit, and Release is 64 bit. Below is the debug then release disassembly of a line of code:
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
0000006f mov eax,dword ptr [ebp-40h]
00000072 shr eax,8
00000075 mov edx,dword ptr [ebp-3Ch]
00000078 mov ecx,0FF00h
0000007d and edx,ecx
0000007f shr edx,8
00000082 mov ecx,dword ptr [ebp-40h]
00000085 mov ebx,0FFh
0000008a and ecx,ebx
0000008c xor edx,ecx
0000008e mov ecx,dword ptr ds:[03387F38h]
00000094 cmp edx,dword ptr [ecx+4]
00000097 jb 0000009E
00000099 call 6F54F5EC
0000009e xor eax,dword ptr [ecx+edx*4+8]
000000a2 mov dword ptr [ebp-40h],eax
-----------------------------------------------------------------------------
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
000000a5 mov eax,dword ptr [rsp+20h]
000000a9 shr eax,8
000000ac mov dword ptr [rsp+38h],eax
000000b0 mov rdx,124DEE68h
000000ba mov rdx,qword ptr [rdx]
000000bd mov eax,dword ptr [rsp+00000090h]
000000c4 and eax,0FF00h
000000c9 shr eax,8
000000cc mov ecx,dword ptr [rsp+20h]
000000d0 and ecx,0FFh
000000d6 xor eax,ecx
000000d8 mov ecx,eax
000000da mov qword ptr [rsp+40h],rdx
000000df mov rax,qword ptr [rsp+40h]
000000e4 mov rax,qword ptr [rax+8]
000000e8 mov qword ptr [rsp+48h],rcx
000000ed cmp qword ptr [rsp+48h],rax
000000f2 jae 0000000000000100
000000f4 mov rax,qword ptr [rsp+48h]
000000f9 mov qword ptr [rsp+48h],rax
000000fe jmp 0000000000000105
00000100 call 000000005FA5D364
00000105 mov rax,qword ptr [rsp+40h]
0000010a mov rcx,qword ptr [rsp+48h]
0000010f mov ecx,dword ptr [rax+rcx*4+10h]
00000113 mov eax,dword ptr [rsp+38h]
00000117 xor eax,ecx
00000119 mov dword ptr [rsp+20h],eax
What is all the extra code in the 64 bit version doing? It is testing for what? I haven't benchmarked this, but the 32 bit code should execute much faster.
EDIT
The whole function:
public static uint CRC32(uint val)
{
uint crc = 0xffffffff;
crc = (crc >> 8) ^ crcTable[(val & 0x000000ff) ^ crc & 0xff];
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
crc = (crc >> 8) ^ crcTable[((val & 0x00ff0000) >> 16) ^ crc & 0xff];
crc = (crc >> 8) ^ crcTable[(val >> 24) ^ crc & 0xff];
// flip bits
return (crc ^ 0xffffffff);
}
I suspect you are using "Go to disassembly" while debugging the release build to get the assembly code.
After going to Tools -> Options, Debugging, General, and disabling "Suppress JIT optimization on module load" I got an x64 assembly listing without error checking.
It seems like by default even in release mode the code is not optimized if the debugger attached. Keep that in mind when trying to benchmark your code.
PS: Benchmarking shows x64 slightly faster than x86, 4.3 vs 4.8 seconds for 1 billion function calls.
Edit: Break points still worked for me, otherwise I wouldn't have been able to see the disassembly after unchecking. Your example line from above looks like this (VS 2012 RC):
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
00000030 mov r11d,eax
00000033 shr r11d,8
00000037 mov ecx,edx
00000039 and ecx,0FF00h
0000003f shr ecx,8
00000042 movzx eax,al
00000045 xor ecx,eax
00000047 mov eax,ecx
00000049 cmp rax,r9
0000004c jae 00000000000000A4
0000004e mov eax,dword ptr [r8+rax*4+10h]
00000053 xor r11d,eax
Looking at the code this is related to the error checking for accessing crcTable. It's doing your bounds before it starts digging into the array.
In the the 32-bit code you see this
0000008e mov ecx,dword ptr ds:[03387F38h]
....
0000009e xor eax,dword ptr [ecx+edx*4+8]
In this case it's loading the base address of the array from 03387F38h and then using standard pointer arithmetic to access the correct entry.
In the 64-bit code this seems to be more complicated.
000000b0 mov rdx,124DEE68h
000000ba mov rdx,qword ptr [rdx]
This loads an address into the rdx register
000000da mov qword ptr [rsp+40h],rdx
...
00000105 mov rax,qword ptr [rsp+40h]
0000010a mov rcx,qword ptr [rsp+48h]
0000010f mov ecx,dword ptr [rax+rcx*4+10h]
This moves the address onto the stack, then later on it moves it into the rax register and does the same pointer work to access the array.
Pretty much everything between 000000da and 00000100/00000105 seems to be validation code. The rest of the code maps pretty well between the 64-bit and the 32-bit code, with some less aggressive register utilization in the 64-bit code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With