Harsh differences in generated assembly of floating-point comparisons < and >=

I'm experimenting with the generated assembly and found an interesting thing. There are two function doing an identical computation. The only difference between them is the way how the results are summed together.

#include <cmath>

double func1(double x, double y)
  double result1;
  double result2;

  if (x*x < 0.0) result1 = 0.0;
    result1 = x*x+x+y;

  if (y*y < 0.0) result2 = 0.0;
    result2 = y*y+y+x;

  return (result1 + result2) * 40.0;

double func2(double x, double y)
  double result = 0.0;

  if (x*x >= 0.0)
    result += x*x+x+y;

  if (y*y >= 0.0)
    result += y*y+y+x;

  return result * 40.0;

The assembly generated by x86 clang 3.7 with -O2 switch on gcc.godbolt.org is yet so much different and unexpected. (compilation on gcc results in similar assembly)

    .quad   4630826316843712512     # double 40
func1(double, double):                             # @func1(double, double)
    movapd  %xmm0, %xmm2
    mulsd   %xmm2, %xmm2
    addsd   %xmm0, %xmm2
    addsd   %xmm1, %xmm2
    movapd  %xmm1, %xmm3
    mulsd   %xmm3, %xmm3
    addsd   %xmm1, %xmm3
    addsd   %xmm0, %xmm3
    addsd   %xmm3, %xmm2
    mulsd   .LCPI0_0(%rip), %xmm2
    movapd  %xmm2, %xmm0

    .quad   4630826316843712512     # double 40
func2(double, double):                             # @func2(double, double)
    movapd  %xmm0, %xmm2
    movapd  %xmm2, %xmm4
    mulsd   %xmm4, %xmm4
    xorps   %xmm3, %xmm3
    ucomisd %xmm3, %xmm4
    xorpd   %xmm0, %xmm0
    jb  .LBB1_2
    addsd   %xmm2, %xmm4
    addsd   %xmm1, %xmm4
    xorpd   %xmm0, %xmm0
    addsd   %xmm4, %xmm0
    movapd  %xmm1, %xmm4
    mulsd   %xmm4, %xmm4
    ucomisd %xmm3, %xmm4
    jb  .LBB1_4
    addsd   %xmm1, %xmm4
    addsd   %xmm2, %xmm4
    addsd   %xmm4, %xmm0
    mulsd   .LCPI1_0(%rip), %xmm0

func1 compiles to a branchless assembly, involving much less instructions than func2. thus func2 is expected to be much slower than func1.

Can someone explain this behavior?

The reason for this behaviour of the comparison operators < or >= differs whether your double is NaN or not a NaN. All comparisons where one of the operands is NaN return false. So your x*x < 0.0 will always be false regardless of whether x is NaN or not. So the compiler can safely optimize this away. However, the case of x * x >= 0 will behave differently for NaN and non-NaN values, thus the compiler leaves the conditional jumps in the assembly.

This is what cppreference says about comparing with NaNs involved:

the values of the operands after conversion are compared in the usual mathematical sense (except that positive and negative zeroes compare equal and any comparison involving a NaN value returns zero)

