float to double conversion: why so many instructions?

Question

I'm curious if someone can shed some light on this for me. I'm working on some numeric data conversion stuff, and I've got several functions that do data conversions, which I define using two macros:

#define CONV_VIA_CAST(name, dtype, vtype)                               \
    static inline void name(void *data, void *view, size_t len) {       \
        vtype *vptr = (vtype*)view;                                     \
        dtype *dptr = (dtype*)data;                                     \
        for (size_t ii=0; ii < len/sizeof(vtype); ii++) {               \
            *vptr++ = (vtype)*dptr++;                                   \
        }                                                               \
    } 


#define CONV_VIA_FUNC(name, dtype, vtype, via)                          \
    static inline void name(void *data, void *view, size_t len) {       \
        vtype *vptr = (vtype*)view;                                     \
        dtype *dptr = (dtype*)data;                                     \
        for (size_t ii=0; ii < len/sizeof(vtype); ii++) {               \
            *vptr++ = (vtype)via(*dptr++);                              \
        }                                                               \
    }

When I define a float to int conversion:

 CONV_VIA_FUNC(f_to_i, float, int16_t, lrintf);

I get a nice terse little piece of assemble with -O3 on:

   0x0000000000401fb0 <+0>:     shr    %rdx
   0x0000000000401fb3 <+3>:     je     0x401fd3 <f_to_i+35>
   0x0000000000401fb5 <+5>:     xor    %eax,%eax
   0x0000000000401fb7 <+7>:     nopw   0x0(%rax,%rax,1)
   0x0000000000401fc0 <+16>:    cvtss2si (%rdi,%rax,4),%rcx
   0x0000000000401fc6 <+22>:    mov    %cx,(%rsi,%rax,2)
   0x0000000000401fca <+26>:    add    $0x1,%rax
   0x0000000000401fce <+30>:    cmp    %rdx,%rax
   0x0000000000401fd1 <+33>:    jne    0x401fc0 <f_to_i+16>
   0x0000000000401fd3 <+35>:    repz retq

However, when I define a float->double (or double->float) function:

CONV_VIA_CAST(f_to_d, float,   double);

I get this monstrosity:

   0x0000000000402040 <+0>:     mov    %rdx,%r8
   0x0000000000402043 <+3>:     shr    $0x3,%r8
   0x0000000000402047 <+7>:     test   %r8,%r8
   0x000000000040204a <+10>:    je     0x402106 <f_to_d+198>
   0x0000000000402050 <+16>:    shr    $0x5,%rdx
   0x0000000000402054 <+20>:    lea    0x0(,%rdx,4),%r9
   0x000000000040205c <+28>:    test   %r9,%r9
   0x000000000040205f <+31>:    je     0x402108 <f_to_d+200>
   0x0000000000402065 <+37>:    lea    (%rdi,%r8,4),%rax
   0x0000000000402069 <+41>:    cmp    $0xb,%r8
   0x000000000040206d <+45>:    lea    (%rsi,%r8,8),%r10
   0x0000000000402071 <+49>:    seta   %cl
   0x0000000000402074 <+52>:    cmp    %rax,%rsi
   0x0000000000402077 <+55>:    seta   %al
   0x000000000040207a <+58>:    cmp    %r10,%rdi
   0x000000000040207d <+61>:    seta   %r10b
   0x0000000000402081 <+65>:    or     %r10d,%eax
   0x0000000000402084 <+68>:    test   %al,%cl
   0x0000000000402086 <+70>:    je     0x402108 <f_to_d+200>
   0x000000000040208c <+76>:    xorps  %xmm3,%xmm3
   0x000000000040208f <+79>:    xor    %eax,%eax
   0x0000000000402091 <+81>:    xor    %ecx,%ecx
   0x0000000000402093 <+83>:    nopl   0x0(%rax,%rax,1)
   0x0000000000402098 <+88>:    movaps %xmm3,%xmm0
   0x000000000040209b <+91>:    add    $0x1,%rcx
   0x000000000040209f <+95>:    movlps (%rdi,%rax,1),%xmm0
   0x00000000004020a3 <+99>:    movhps 0x8(%rdi,%rax,1),%xmm0
   0x00000000004020a8 <+104>:   movhlps %xmm0,%xmm1
   0x00000000004020ab <+107>:   cvtps2pd %xmm0,%xmm2
   0x00000000004020ae <+110>:   cvtps2pd %xmm1,%xmm0
   0x00000000004020b1 <+113>:   movlpd %xmm2,(%rsi,%rax,2)
   0x00000000004020b6 <+118>:   movhpd %xmm2,0x8(%rsi,%rax,2)
   0x00000000004020bc <+124>:   movlpd %xmm0,0x10(%rsi,%rax,2)
   0x00000000004020c2 <+130>:   movhpd %xmm0,0x18(%rsi,%rax,2)
   0x00000000004020c8 <+136>:   add    $0x10,%rax
   0x00000000004020cc <+140>:   cmp    %rcx,%rdx
   0x00000000004020cf <+143>:   ja     0x402098 <f_to_d+88>
   0x00000000004020d1 <+145>:   cmp    %r9,%r8
   0x00000000004020d4 <+148>:   lea    (%rsi,%r9,8),%rsi
   0x00000000004020d8 <+152>:   lea    (%rdi,%r9,4),%rdi
   0x00000000004020dc <+156>:   je     0x40210d <f_to_d+205>
   0x00000000004020de <+158>:   mov    %r9,%rdx
   0x00000000004020e1 <+161>:   mov    %r9,%rax
   0x00000000004020e4 <+164>:   neg    %rdx
   0x00000000004020e7 <+167>:   lea    (%rsi,%rdx,8),%rcx
   0x00000000004020eb <+171>:   lea    (%rdi,%rdx,4),%rdx
   0x00000000004020ef <+175>:   nop
   0x00000000004020f0 <+176>:   movss  (%rdx,%rax,4),%xmm0
   0x00000000004020f5 <+181>:   cvtps2pd %xmm0,%xmm0
   0x00000000004020f8 <+184>:   movsd  %xmm0,(%rcx,%rax,8)
   0x00000000004020fd <+189>:   add    $0x1,%rax
   0x0000000000402101 <+193>:   cmp    %rax,%r8
   0x0000000000402104 <+196>:   ja     0x4020f0 <f_to_d+176>
   0x0000000000402106 <+198>:   repz retq 
   0x0000000000402108 <+200>:   xor    %r9d,%r9d
   0x000000000040210b <+203>:   jmp    0x4020de <f_to_d+158>
   0x000000000040210d <+205>:   nopl   (%rax)
   0x0000000000402110 <+208>:   retq

Can anyone shed some light on what's going on under the hood here for the float->double conversion? And perhaps how it might be written to get more efficient assembly out? I'm using gcc 4.6.3 if that matters.

Pascal Cuoq · Accepted Answer

What you call “monstrosity” actually looks like automatically vectorized code. Something like 20 years of research have gone into this kind of technique before it started to work well and to be useful in general purpose compilers.

It may not be pretty, but GCC implementors think it will be faster for long arrays. If your arrays aren't actually long, or if you can't bear the idea of the compiled code looking like this, disable that particular optimization. Compiling with -O2 should do it (untried).

harold · Answer

There's several things going on here that I can see quickly (the code is a bit long, the time is a bit late, and I'm not a fan of AT&T syntax).

Firstly, the second loop was vectorized (but badly, see below). That inherently causes some code bloat - it now has to deal with a "tail end" that is shorter than a vector and such.

Second, float to double is a widening conversion. That doesn't matter for scalars, but with vectors that means that you can't just read some data, convert it and write it back - somewhere along the lines you'll end up with double as many bytes and they have to be dealt with. (hence the movhlps %xmm0,%xmm1)

The actual loop only spans from 402098h to 4020cfh, below that is the "tail handling", and above that is a monstrosity that tests whether it has the skip the main loop completely and some things I haven't quite figured out - it would make sense if it was for alignment, but I don't see any test rdi, 15's in there, nor anything obvious that would get rid of an unaligned beginning.

And thirdly, GCC is being lame. This is not unusual. It seems to think that xmm3 is somehow involved, which it isn't, and it seems to have forgotten that vectors can be loaded to an from memory in one piece - then again this could be because the monstrosity at the beginning really didn't test for alignment and this is its defense against unaligned pointers. In any case though, GCC did a bad job here.

float to double conversion: why so many instructions?

Tags:

c

compiler-optimization

assembly

gct

2 Answers

Pascal Cuoq

harold

Recent Activity

Donate For Us

float to double conversion: why so many instructions?

Tags:

c

compiler-optimization

assembly

gct

2 Answers

Pascal Cuoq

harold

Related questions

Recent Activity

Donate For Us