I'm testing what sort of speedup I can get from using SIMD instructions with RyuJIT and I'm seeing some disassembly instructions that I don't expect. I'm basing the code on this blog post from the RyuJIT team's Kevin Frei, and a related post here. Here's the function:
static void AddPointwiseSimd(float[] a, float[] b) {
int simdLength = Vector<float>.Count;
int i = 0;
for (i = 0; i < a.Length - simdLength; i += simdLength) {
Vector<float> va = new Vector<float>(a, i);
Vector<float> vb = new Vector<float>(b, i);
va += vb;
va.CopyTo(a, i);
}
}
The section of disassembly I'm querying copies the array values into the Vector<float>
. Most of the disassembly is similar to that in Kevin and Sasha's posts, but I've highlighted some extra instructions (along with my confused annotations) that don't appear in their disassemblies:
;// Vector<float> va = new Vector<float>(a, i);
cmp eax,r8d ; <-- Unexpected - Compare a.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
lea r10d,[rax+3]
cmp r10d,r8d
jae 00007FFB17DB6D5F
mov r11,rcx ; <-- Unexpected - Extra register copy?
movups xmm0,xmmword ptr [r11+rax*4+10h ]
;// Vector<float> vb = new Vector<float>(b, i);
cmp eax,r9d ; <-- Unexpected - Compare b.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
cmp r10d,r9d
jae 00007FFB17DB6D5F
movups xmm1,xmmword ptr [rdx+rax*4+10h]
Note the loop range check is as expected:
;// for (i = 0; i < a.Length - simdLength; i += simdLength) {
add eax,4
cmp r9d,eax
jg loop
so I don't know why there are extra comparisons to eax
. Can anyone explain why I'm seeing these extra instructions and if it's possible to get rid of them.
In case it's related to the project settings I've got a very similar project that shows the same issue here on github (see FloatSimdProcessor.HwAcceleratedSumInPlace()
or UShortSimdProcessor.HwAcceleratedSumInPlaceUnchecked()
).
I'll annotate the code generation that I see, for a processor that supports AVX2 like Haswell, it can move 8 floats at a time:
00007FFA1ECD4E20 push rsi
00007FFA1ECD4E21 sub rsp,20h
00007FFA1ECD4E25 xor eax,eax ; i = 0
00007FFA1ECD4E27 mov r8d,dword ptr [rcx+8] ; a.Length
00007FFA1ECD4E2B lea r9d,[r8-8] ; a.Length - simdLength
00007FFA1ECD4E2F test r9d,r9d ; if (i >= a.Length - simdLength)
00007FFA1ECD4E32 jle 00007FFA1ECD4E75 ; then skip loop
00007FFA1ECD4E34 mov r10d,dword ptr [rdx+8] ; b.Length
00007FFA1ECD4E38 cmp eax,r8d ; if (i >= a.Length)
00007FFA1ECD4E3B jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E3D lea r11d,[rax+7] ; i+7
00007FFA1ECD4E41 cmp r11d,r8d ; if (i+7 >= a.Length)
00007FFA1ECD4E44 jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E46 mov rsi,rcx ; move a[i..i+7]
00007FFA1ECD4E49 vmovupd ymm0,ymmword ptr [rsi+rax*4+10h]
00007FFA1ECD4E50 cmp eax,r10d ; same as above
00007FFA1ECD4E53 jae 00007FFA1ECD4E7B ; but for b
00007FFA1ECD4E55 cmp r11d,r10d
00007FFA1ECD4E58 jae 00007FFA1ECD4E7B
00007FFA1ECD4E5A vmovupd ymm1,ymmword ptr [rdx+rax*4+10h]
00007FFA1ECD4E61 vaddps ymm0,ymm0,ymm1 ; a[i..] + b[i...]
00007FFA1ECD4E66 vmovupd ymmword ptr [rsi+rax*4+10h],ymm0
00007FFA1ECD4E6D add eax,8 ; i += 8
00007FFA1ECD4E70 cmp r9d,eax ; if (i < a.Length)
00007FFA1ECD4E73 jg 00007FFA1ECD4E38 ; then loop
00007FFA1ECD4E75 add rsp,20h
00007FFA1ECD4E79 pop rsi
00007FFA1ECD4E7A ret
So the eax compares are those "pesky bound checks" that the blog post talks about. The blog post gives an optimized version that is not actually implemented (yet), real code right now checks both the first and the last index of the 8 floats that are moved at the same time. The blog post's comment "Hopefully, we'll get our bounds-check elimination work strengthened enough" is an uncompleted task :)
The mov rsi,rcx
instruction is present in the blog post as well and appears to be a limitation in the register allocator. Probably influenced by RCX being an important register, it normally stores this. Not important enough to do the work to get this optimized away I'd assume, register-to-register moves take 0 cycles since they only affect register renaming.
Note how the difference between SSE2 and AVX2 is ugly, while the code moves and adds 8 floats at a time, it only actually uses 4 of them. Vector<float>.Count
is 4 regardless of the processor flavor, leaving 2x perf on the table. Hard to hide the implementation detail I guess.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With