Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cast SIMD int vectors to float in GCC?

I'm using the GCC SIMD vector extension for a project, everything works quite quite well but casts, they simply reset all the components of a vector.

The manual states:

It is possible to cast from one vector type to another, provided they are of the same size (in fact, you can also cast vectors to and from other datatypes of the same size).

Here's a simple example:

#include <stdio.h>

typedef int int4 __attribute__ (( vector_size( sizeof( int ) * 4 ) ));
typedef float float4 __attribute__ (( vector_size( sizeof( float ) * 4 ) ));

int main()
{
    int4 i = { 1 , 2 , 3 , 4 };
    float4 f = { 0.1 , 0.2 , 0.3 , 0.4 };

    printf( "%i %i %i %i\n" , i[0] , i[1] , i[2] , i[3] );
    printf( "%f %f %f %f\n" , f[0] , f[1] , f[2] , f[3] );

    f = ( float4 )i;

    printf( "%f %f %f %f\n" , f[0] , f[1] , f[2] , f[3] );
}

Compiling with gcc cast.c -O3 -o cast and running on my machine I get:

1 2 3 4
0.100000 0.200000 0.300000 0.400000
0.000000 0.000000 0.000000 0.000000 <-- no no no

I'm not that assembler guru but I just see some byte movements here:

[...]
400454:       f2 0f 10 1d 1c 02 00    movsd  0x21c(%rip),%xmm3
40045b:       00 
40045c:       bf 49 06 40 00          mov    $0x400649,%edi
400461:       f2 0f 10 15 17 02 00    movsd  0x217(%rip),%xmm2
400468:       00 
400469:       b8 04 00 00 00          mov    $0x4,%eax
40046e:       f2 0f 10 0d 12 02 00    movsd  0x212(%rip),%xmm1
400475:       00 
400476:       f2 0f 10 05 12 02 00    movsd  0x212(%rip),%xmm0
40047d:       00 
40047e:       48 83 c4 08             add    $0x8,%rsp
400482:       e9 59 ff ff ff          jmpq   4003e0 

I suspect the vector equivalent of the scalar:

*( int * )&float_value = int_value;

How can you explain this behavior?

like image 276
cYrus Avatar asked Sep 11 '12 17:09

cYrus


2 Answers

That's what vector casts are defined to do (anything else would be completely bonkers, and would make standard vector programming idioms very painful to write). If you want to actually get a conversion, you'll probably want to use an intrinsic of some sort, like _mm_cvtepi32_ps (this breaks the nice architectural independence of your vector code, of course, which is also annoying; a common approach is to use a translation header that defines a portable set of "intrinsics").

Why is this useful? A variety of reasons, but here's the biggest:

In vector code, you almost never want to branch. Instead, if you need to do something conditionally, you evaluate both sides of the condition, and use a mask to select the appropriate result lane by lane. These mask vectors "naturally" have integer type, whereas your data vectors are often floating-point; you want to combine the two using logical operations. This extremely common idiom is most natural if vector casts simply re-interpret the bits.

Granted, it's possible to work around this case, or any of a bag of other common vector idioms, but the "vector is a bag of bits" view is extremely common, and reflects the way most vector programmers think.

like image 167
Stephen Canon Avatar answered Sep 23 '22 01:09

Stephen Canon


As a matter of fact, no single vector instruction is being even generated in your case and no typecast is even being performed at runtime. It is all done at compile time because of the -O3 switch. The four MOVSD instructions are actually loading the preconverted arguments to printf. Indeed, according to the SysV AMD64 ABI, floating-point arguments are passed in the XMM registers. The section that you have disassembled is (assembly code obtained by compiling with -S):

    movsd   .LC6(%rip), %xmm3
    movl    $.LC5, %edi
    movsd   .LC7(%rip), %xmm2
    movl    $4, %eax
    movsd   .LC8(%rip), %xmm1
    movsd   .LC9(%rip), %xmm0
    addq    $8, %rsp
    .cfi_def_cfa_offset 8
    jmp     printf
    .cfi_endproc

.LC5 labels the format string:

.LC5:
    .string "%f %f %f %f\n"

The pointer to the format string is of class INTEGER and thus is passed in the RDI register (being somewhere in the first 4 GiB of the VA space, some code bytes are saved by issuing a 32-bit move to the lower part of RDI). Register RAX (EAX used to save on code bytes) is loaded with the number of arguments passed in the XMM registers (again according to the SysV AMD64 ABI for calls to functions with variable number of arguments). All the four MOVSD (MOVe Scalar Double-precision) move the corresponding arguments in the XMM registers. .LC9 for example labels two doublewords:

    .align 8
.LC9:
    .long   0
    .long   916455424

Those two form the 64-bit quadword 0x36A0000000000000 which happens to be 2-149 in 64-bit IEEE 754 representation. In denormalised 32-bit IEEE 754 it looks like 0x00000001, so indeed it is a no conversion of the integer 1 (but since printf expects double arguments it is still preconverted to double precision). The second argument is:

    .align 8
.LC8:
    .long   0
    .long   917504000

This is 0x36B0000000000000 or 2-148 in 64-bit IEEE 754 and 0x00000002 in denormalised 32-bit IEEE 754. It goes on the same for the other two arguments.

Note that the above code doesn't use a single stack variable - it operates with precomputed constants only. This results from using very high optimisation level (-O3). An actual runtime conversion happens if you compile with a lower optimisation level (-O2 or lower). The following code is then emitted to perform the typecast:

    movaps  -16(%rbp), %xmm0
    movaps  %xmm0, -32(%rbp)

This just moves the four integer values into the corresponding slots of the floating point vector, hence no conversion whatsoever. Then for each element some SSE mumbo-jumbo is performed in order to convert it from single precision to double precision (as expected by printf):

    movss   -20(%rbp), %xmm0
    unpcklps        %xmm0, %xmm0
    cvtps2pd        %xmm0, %xmm3

(why not just use CVTSS2SD is beyond my understanding of the SSE instruction set)

like image 38
Hristo Iliev Avatar answered Sep 23 '22 01:09

Hristo Iliev