I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimization. More specifically, how well compilers vectorize. It is a personal journey towards being able to verify gcc -O3 performance with more complex computation. I understand that conventional wisdom is that compilers are better than people, but I never take such wisdom for granted.
In my first simple test, though, I'm finding some of the choices gcc makes quite strange and, quite honestly, grossly negligent in terms of optimization. I'm willing to assume there is something the compiler is purposeful and knows something about the CPU (Intel i5-2557M in this case) that I do not. But I need some confirmation from knowledgeable people.
My simple test code (segment) is:
int i;
float a[100];
for (i=0;i<100;i++) a[i]= (float) i*i;
The resulting assembly code (segment) that corresponds to the for-loop is as follows:
.L6: ; loop starts here
movdqa xmm0, xmm1 ; copy packed integers in xmm1 to xmm0
.L3:
movdqa xmm1, xmm0 ; wait, what!? WHY!? this is redundant.
cvtdq2ps xmm0, xmm0 ; convert integers to float
add rax, 16 ; increment memory pointer for next iteration
mulps xmm0, xmm0 ; pack square all integers in xmm0
paddd xmm1, xmm2 ; pack increment all integers by 4
movaps XMMWORD PTR [rax-16], xmm0 ; store result
cmp rax, rdx ; test loop termination
jne .L6
I understand all the steps, and computationally, all of it makes sense. What I don't understand, though, is gcc choosing to incorporate in the iterative loop a step to load xmm1 with xmm0 right after xmm0 was loaded with xmm1. i.e.
.L6
movdqa xmm0, xmm1 ; loop starts here
.L3
movdqa xmm1, xmm0 ; grrr!
This alone makes me question the sanity of the optimizer. Obviously, the extra MOVDQA does not disturb data, but at face-value, it would seems grossly negligent on the part of gcc.
Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized to some value meaningful for vectorization, so obviously, at the onset of the loop, the code has to skip the first MOVDQA. But why doesn't gcc simply rearrange, as shown below.
.L3
movdqa xmm1, xmm0 ; initialize xmm1 PRIOR to loop
.L6
movdqa xmm0, xmm1 ; loop starts here
Or even better, simply initialize xmm1 instead of xmm0 and dump the MOVDQA xmm1, xmm0 step altogether!
I am prepared to believe that the CPU is smart enough to skip the redundant step or something like that, but how can I trust gcc to fully optimize complex code, if it can even get this simple code right? Or can someone provide a sound explanation that would give me faith that gcc -O3 is good stuff?
I'm not 100% sure, but it looks like your loop destroys xmm0
by converting it to float
, so you to have the integer value in xmm1
and then copy over to another register (in this case xmm0
).
Whilst compilers are known to sometimes issue unnecessary instructions, I can't really see how this is the case in this instance.
If you want xmm0
(or xmm1
) to remain integer, then don't have a cast of float
for the first value of i
. Perhaps what you wanted to do is:
for (i=0;i<100;i++)
a[i]= (float)(i*i);
But on the other hand, gcc 4.9.2 doesn't seem to do this:
g++ -S -O3 floop.cpp
.L2:
cvtdq2ps %xmm1, %xmm0
mulps %xmm0, %xmm0
addq $16, %rax
paddd %xmm2, %xmm1
movaps %xmm0, -16(%rax)
cmpq %rbp, %rax
jne .L2
Nor does clang (3.7.0 from about 3 weeks ago)
clang++ -S -O3 floop.cpp
movdqa .LCPI0_0(%rip), %xmm0 # xmm0 = [0,1,2,3]
xorl %eax, %eax
.align 16, 0x90
.LBB0_1: # %vector.body
# =>This Inner Loop Header: Depth=1
movd %eax, %xmm1
pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0]
paddd %xmm0, %xmm1
cvtdq2ps %xmm1, %xmm1
mulps %xmm1, %xmm1
movaps %xmm1, (%rsp,%rax,4)
addq $4, %rax
cmpq $100, %rax
jne .LBB0_1
Code that I have compiled:
extern int printf(const char *, ...);
int main()
{
int i;
float a[100];
for (i=0;i<100;i++)
a[i]= (float) i*i;
for (i=0; i < 100; i++)
printf("%f\n", a[i]);
}
(I added the printf to avoid the compiler getting rid of ALL the code)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With