I'm using the GCC SIMD vector extension for a project, everything works quite quite well but casts, they simply reset all the components of a vector.
The manual states:
It is possible to cast from one vector type to another, provided they are of the same size (in fact, you can also cast vectors to and from other datatypes of the same size).
Here's a simple example:
#include <stdio.h>
typedef int int4 __attribute__ (( vector_size( sizeof( int ) * 4 ) ));
typedef float float4 __attribute__ (( vector_size( sizeof( float ) * 4 ) ));
int main()
{
int4 i = { 1 , 2 , 3 , 4 };
float4 f = { 0.1 , 0.2 , 0.3 , 0.4 };
printf( "%i %i %i %i\n" , i[0] , i[1] , i[2] , i[3] );
printf( "%f %f %f %f\n" , f[0] , f[1] , f[2] , f[3] );
f = ( float4 )i;
printf( "%f %f %f %f\n" , f[0] , f[1] , f[2] , f[3] );
}
Compiling with gcc cast.c -O3 -o cast
and running on my machine I get:
1 2 3 4
0.100000 0.200000 0.300000 0.400000
0.000000 0.000000 0.000000 0.000000 <-- no no no
I'm not that assembler guru but I just see some byte movements here:
[...] 400454: f2 0f 10 1d 1c 02 00 movsd 0x21c(%rip),%xmm3 40045b: 00 40045c: bf 49 06 40 00 mov $0x400649,%edi 400461: f2 0f 10 15 17 02 00 movsd 0x217(%rip),%xmm2 400468: 00 400469: b8 04 00 00 00 mov $0x4,%eax 40046e: f2 0f 10 0d 12 02 00 movsd 0x212(%rip),%xmm1 400475: 00 400476: f2 0f 10 05 12 02 00 movsd 0x212(%rip),%xmm0 40047d: 00 40047e: 48 83 c4 08 add $0x8,%rsp 400482: e9 59 ff ff ff jmpq 4003e0
I suspect the vector equivalent of the scalar:
*( int * )&float_value = int_value;
How can you explain this behavior?
That's what vector casts are defined to do (anything else would be completely bonkers, and would make standard vector programming idioms very painful to write). If you want to actually get a conversion, you'll probably want to use an intrinsic of some sort, like _mm_cvtepi32_ps (this breaks the nice architectural independence of your vector code, of course, which is also annoying; a common approach is to use a translation header that defines a portable set of "intrinsics").
Why is this useful? A variety of reasons, but here's the biggest:
In vector code, you almost never want to branch. Instead, if you need to do something conditionally, you evaluate both sides of the condition, and use a mask to select the appropriate result lane by lane. These mask vectors "naturally" have integer type, whereas your data vectors are often floating-point; you want to combine the two using logical operations. This extremely common idiom is most natural if vector casts simply re-interpret the bits.
Granted, it's possible to work around this case, or any of a bag of other common vector idioms, but the "vector is a bag of bits" view is extremely common, and reflects the way most vector programmers think.
As a matter of fact, no single vector instruction is being even generated in your case and no typecast is even being performed at runtime. It is all done at compile time because of the -O3
switch. The four MOVSD
instructions are actually loading the preconverted arguments to printf
. Indeed, according to the SysV AMD64 ABI, floating-point arguments are passed in the XMM registers. The section that you have disassembled is (assembly code obtained by compiling with -S
):
movsd .LC6(%rip), %xmm3
movl $.LC5, %edi
movsd .LC7(%rip), %xmm2
movl $4, %eax
movsd .LC8(%rip), %xmm1
movsd .LC9(%rip), %xmm0
addq $8, %rsp
.cfi_def_cfa_offset 8
jmp printf
.cfi_endproc
.LC5
labels the format string:
.LC5:
.string "%f %f %f %f\n"
The pointer to the format string is of class INTEGER
and thus is passed in the RDI
register (being somewhere in the first 4 GiB of the VA space, some code bytes are saved by issuing a 32-bit move to the lower part of RDI
). Register RAX
(EAX
used to save on code bytes) is loaded with the number of arguments passed in the XMM registers (again according to the SysV AMD64 ABI for calls to functions with variable number of arguments). All the four MOVSD
(MOVe Scalar Double-precision) move the corresponding arguments in the XMM registers. .LC9
for example labels two doublewords:
.align 8
.LC9:
.long 0
.long 916455424
Those two form the 64-bit quadword 0x36A0000000000000
which happens to be 2-149 in 64-bit IEEE 754 representation. In denormalised 32-bit IEEE 754 it looks like 0x00000001
, so indeed it is a no conversion of the integer 1
(but since printf
expects double
arguments it is still preconverted to double precision). The second argument is:
.align 8
.LC8:
.long 0
.long 917504000
This is 0x36B0000000000000
or 2-148 in 64-bit IEEE 754 and 0x00000002
in denormalised 32-bit IEEE 754. It goes on the same for the other two arguments.
Note that the above code doesn't use a single stack variable - it operates with precomputed constants only. This results from using very high optimisation level (-O3
). An actual runtime conversion happens if you compile with a lower optimisation level (-O2
or lower). The following code is then emitted to perform the typecast:
movaps -16(%rbp), %xmm0
movaps %xmm0, -32(%rbp)
This just moves the four integer values into the corresponding slots of the floating point vector, hence no conversion whatsoever. Then for each element some SSE mumbo-jumbo is performed in order to convert it from single precision to double precision (as expected by printf
):
movss -20(%rbp), %xmm0
unpcklps %xmm0, %xmm0
cvtps2pd %xmm0, %xmm3
(why not just use CVTSS2SD
is beyond my understanding of the SSE instruction set)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With