Comparing Fortran & C++ assembler for int = floor(sqrt(...))

Q: Which Fortran compilers are used in the tests?

The tested Fortran programs are in source code on the website of Polyhedron Solutions Ltd.. Below are our results of the run-time benchmarks. The 64-bit compiler variants were used, with the exception of the Lahey Fortran LF compiler (formerly LF95), which exists only as a 32-bit compiler and is also listed here because of its wide distribution.

Q: What is the latest version of plusfort for Fortran?

Version 7.50 of plusFORT , the premier Software Engineering toolkit for Fortran is out! Now includes HyperKWIC, Polyhedron’s new interactive Software Documentation tool. Which is the best Fortran compiler? We’re often asked that question, but there is no single answer.

Q: What is the test environment for the Fortran program?

The test environment is a standard PC equipped with an Intel® Core ™ i7-9700F CPU @ 3.00GHz (8 cores) and 16GB of RAM. On this both Microsoft Windows 10 and Linux Mint 19.2 are installed. The tested Fortran programs are in source code on the website of Polyhedron Solutions Ltd..

Q: What should be included in a good FORTRAN Research paper?

In particular, the diagnostic capabilities and the respective Fortran language scope (Fortran 95, 2003, 2008, etc.) should be noted. Regarding the latter, the tables of Ian Chivers & Jane Sleightholme are also informative.

Tags:

c++

performance

assembly

x86-64

I have implemented a function in Fortran and C++ each:

#include <math.h>

void dbl_sqrt_c(double *x, double *y){
   *y = sqrt(*x - 1.0);
   return;
}

pure subroutine my_dbl_sqrt(x,y) bind(c, name="dbl_sqrt_fort")
   USE, INTRINSIC :: ISO_C_BINDING
   implicit none
   real(kind=c_double), intent(in)  :: x
   real(kind=c_double), intent(out) :: y

   y = sqrt(x - 1d0)
end subroutine my_dbl_sqrt

I compared them in the compiler explorer:

Fortran: https://godbolt.org/z/froz4rx97
C++: https://godbolt.org/z/45aex99Yz

And the way I read the assembler, they do basically the same thing, but C++ checks whether the argument of the sqrt is negative, which Fortran doesn't. I compared their performance using googles benchmark, but they are pretty evenly matched:

--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
bm_dbl_c/8          2.07 ns         2.07 ns    335965892
bm_dbl_fort/8       2.06 ns         2.06 ns    338643106

Here is the interesting part. If I turn this into integer based functions:

void int_sqrt_c(int *x, int *y){
   *y = floor(sqrt(*x - 1.0));
   return;
}

and

pure subroutine my_int_root(x,y) bind(c, name="int_sqrt_fort")
   USE, INTRINSIC :: ISO_C_BINDING
   implicit none
   integer(kind=c_int), intent(in)  :: x
   integer(kind=c_int), intent(out) :: y

   y = floor(sqrt(x - 1d0))
end subroutine my_int_root

Then this is where they start to diverge:

--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
bm_int_c/8          3.05 ns         3.05 ns    229239198
bm_int_fort/8       2.13 ns         2.13 ns    328933185

The Fortran code seems not significantly slower by this change, but the C++ code slowed down by 50%. This seems quite large. These are the assemblies:

Fortran: https://godbolt.org/z/axqqrc5E1
C++: https://godbolt.org/z/h7K75oKbn

The Fortran assembly seems pretty straight forward. It just adds conversion between double and int and not much else, but C++ seems to do a lot more, which I don't full understand.

Why is the C++ assembler so much more complicated? How could I improve the C++ code to achieve matching performance?

297

asked Apr 11 '21 15:04

Stein

Video Answer

2 Answers

TL;DR: You're hobbled by bad defaults and compatibility with obsolete machines: Bad defaults are gcc setting errno for floating-point computations (despite this not being required by the C language), and compatibility with x86 machines that don't have any better SSE instructions than SSE2. If you want decent code generation, add -fno-math-errno -msse4 to the compiler flags.

Modern processor architectures that contain floating-point hardware typically offer square root calculation as a primitive operation (instruction), which sets an error flag in the floating-point environment if the operand of the square root instruction was out of range (negative). On the other hand, old instruction set architectures may not have had floating point instructions, or not have had hardware accelerated square root instructions, so the C language permits an implementation to set errno on an out of range argument instead, but errno being a thread-local memory location practically prevents any sane architecture from setting errno directly from the square root instruction. To get decent performance, gcc inlines the square root calculation by calling the hardware instruction (sqrtsd), but to set errno, it seperately checks the sign of the argument, and calls to the library function only in case the argument was negative, so the library function can set errno. Yes, this is crazy, but that in turn is par for the course. You can avoid this braindamage that nobody ever needs or wants by setting -fno-math-errno in the compiler flags.

Reasonably recent x86-64 processors have more instructions than were present in the original x86-64 as first developed by AMD (which included only SSE2 vector/floating-point instructions). Among the added instructions are float/integer conversion instructions that allow controlled rounding/truncation, so this doesn't have to be implemented in software. You can get gcc to use these new instructions by specifying a target that supports these instructions, for example by using the -msse4 compiler flag. Note that this will cause the generated program to fault if it is run on a target that doesn't support these instructions, so the generated program will be less portable (though it doesn't reduce portability of the source code obviously).

142

answered Oct 20 '22 02:10

EOF

As @EOF explained, always use -fno-math-errno, and use -march=native if you can, or at least -msse4 if you don't need the binary to run on old machines like first-gen Core 2, or AMD Phenom II. (Preferably also other good stuff like popcnt and SSE4.2, like
-march=nehalem -mtune=haswell)

And preferably -O3 to enable SIMD auto-vectorization when possible, although that's sometimes not possible for floating point, e.g. reductions like dot product or sum of an array, because FP math isn't quite associative.

(-ffast-math will let compilers pretend it is, or you can use #pragma omp simd reduction(+:my_var_name) to give it license to rearrange order of operations in just one loop. But that's not safe in general, e.g. if your code does stuff like Kahan summation to compensate for FP rounding error, and enables other optimizations like treating denormals as 0.)

You can get even better asm out of gfortran / gcc by omitting the floor(). Neither seems to realize that any result of sqrt will be non-negative or NaN, and thus floor(sqrt) = trunc(sqrt)

(Related: if you want to round a double to the nearest int, rather than floor or trunc, use lrint(x) or (int)nearbyint(x), which can inline to an instruction that uses the current FP rounding mode, which (unless you changed it) will be round-to-nearest (even as tiebreak). See this answer for how it compiles for x86-64 and AArch64.)

GCC's back-end doesn't optimize away floor (with either front-end language), -msse4 just makes it cheap enough for the throughput bottleneck of sqrtsd to hide its cost in your C benchmark. Specifically, we get

# gcc -O2 -fno-math-errno -msse4   with floor() still in the source.
        sqrtsd  xmm0, xmm0
        roundsd xmm0, xmm0, 9              # floor separately, result as a double
        cvttsd2si       eax, xmm0          # convert (with Truncation) to signed int

Even without SSE4, GFortran chooses to use a flooring-conversion to int (which only has to work for the range of int, not for doubles outside that range, which is what GCC's code was doing manually without -msse4):

# GFortran -O2 -msse4      # with floor()   # chooses not to use SSE4
        sqrtsd  xmm0, xmm0
        cvttsd2si       eax, xmm0          # eax = int trunc(sqrt result)
        cvtsi2sd        xmm1, eax          # convert back to double
        ucomisd xmm0, xmm1                 # compare, setting CF if truncation rounded *up*
        sbb     eax, 0                     # eax -= CF

Fun fact: this code avoids checking for "unordered" (PF=1) in the ucomisd FLAGS result. CF = 1 for that case, but apparently gfortran doesn't care that it will make the result 0x7FFFFFFF instead of 0x80000000.

gcc could have done this for C, since the behaviour is undefined if the double->int conversion result doesn't fit in an int. (Fun fact, negative double -> unsigned is also UB for that reason; the modular range-reduction rule is only for wide integral types->unsigned.) So this is a gcc missed optimization bug for C without SSE4.

When SSE4 is available, it's better to use roundsd before conversion (in the general case where floor can't just be optimized away). Round-trip conversion to integer and back is 12 cycle latency on Skylake, so an extra maybe 7 vs. just one convert, plus ucomisd + sbb latency. vs. 8 cycles for roundsd. And roundsd is fewer total uops (https://uops.info). So this is a missed optimization for gfortran -msse4 which continues to use its double->int->double compare / sbb trick for floor-conversion.

Optimizing the source: remove the `floor`

C (and Fortran) FP -> int conversion truncates (rounds towards 0.0) by default. For non-negative integers, this is equivalent to floor, so it's redundant. Since compilers don't realize that and/or don't take advantage of the fact that a sqrt result is non-negative (or NaN), we can remove it ourselves:

#include <math.h>
int int_sqrt_c(int x){
   return sqrt(x - 1.0);
   //return floor(sqrt(x - 1.0));
}

I also simplified your C to take/return an int, instead of via pointers. https://godbolt.org/z/4zWeaej9T

int_sqrt_c(int):
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, edi             # int -> double
        subsd   xmm0, QWORD PTR .LC0[rip]
        sqrtsd  xmm0, xmm0
        cvttsd2si       eax, xmm0             # truncating double -> int
        ret

Exact same difference for the Fortan code, https://godbolt.org/z/EhbTjhGen - removing floor removes the useless instructions, leaving just cvttsd2si.

roundsd costs 2 uops and has 8 cycle latency on Skylake (like a chain of two additions) https://uops.info/, so avoiding it takes out a decent fraction of the total latency from x -> retval, and of the front-end throughput cost.

(Your throughput benchmark will bottleneck on the back-end throughput of sqrtsd, e.g. one per 4.5 cycles on Skylake, and OoO exec hides the latency, so your current test setup probably won't be able to measure the difference.)

answered Oct 20 '22 03:10

Peter Cordes

Related questions
                            
                                Efficiency on a 64-bit platform: pointer vs 32-bit array indexing
                            
                                Do we have C++20 ranges library in GCC 9?
                            
                                Does an R compiler to C/C++ exist?
                            
                                Determining if an unordered vector<T> has all unique elements
                            
                                Simple JSON string escape for C++?
                            
                                unsigned int (c++) vs uint (c#)
                            
                                Does exception handling require object-oriented programming?
                            
                                How do I implement convenient logging without a Singleton?
                            
                                C++ STL map: is access time O(1)?
                            
                                Why is template parameter pack used in a function argument type as its template argument list not able to be explicit specified
                            
                                Is passing pointer argument, pass by value in C++?
                            
                                How to use std::async on a member function?
                            
                                Extending setuptools extension to use CMake in setup.py?
                            
                                What does the >?= operator mean?
                            
                                redefinition errors in WinSock2.h [duplicate]
                            
                                Why is sizeof(BaseClass) == sizeof(DerivedClass) although I add a member
                            
                                C++ function pointer (class member) to non-static member function
                            
                                Does "&s[0]" point to contiguous characters in a std::string?
                            
                                Is it valid, to use std::string to hold binary data, to avoid manual dynamic memory management
                            
                                What's the C++ 11 way to fire off an asynchronous task and forget about it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing Fortran & C++ assembler for int = floor(sqrt(...))

Tags:

c++

performance

assembly

x86-64

Stein

People also ask

Video Answer

2 Answers

EOF

Optimizing the source: remove the `floor`

Peter Cordes

Recent Activity

Donate For Us

Comparing Fortran & C++ assembler for int = floor(sqrt(...))

Tags:

c++

performance

assembly

x86-64

Stein

People also ask

Video Answer

2 Answers

EOF

Optimizing the source: remove the floor

Peter Cordes

Related questions

Recent Activity

Donate For Us

Optimizing the source: remove the `floor`