I have implemented a function in Fortran and C++ each:
#include <math.h>
void dbl_sqrt_c(double *x, double *y){
*y = sqrt(*x - 1.0);
return;
}
pure subroutine my_dbl_sqrt(x,y) bind(c, name="dbl_sqrt_fort")
USE, INTRINSIC :: ISO_C_BINDING
implicit none
real(kind=c_double), intent(in) :: x
real(kind=c_double), intent(out) :: y
y = sqrt(x - 1d0)
end subroutine my_dbl_sqrt
I compared them in the compiler explorer:
Fortran: https://godbolt.org/z/froz4rx97
C++: https://godbolt.org/z/45aex99Yz
And the way I read the assembler, they do basically the same thing, but C++ checks whether the argument of the sqrt is negative, which Fortran doesn't. I compared their performance using googles benchmark, but they are pretty evenly matched:
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
bm_dbl_c/8 2.07 ns 2.07 ns 335965892
bm_dbl_fort/8 2.06 ns 2.06 ns 338643106
Here is the interesting part. If I turn this into integer based functions:
void int_sqrt_c(int *x, int *y){
*y = floor(sqrt(*x - 1.0));
return;
}
and
pure subroutine my_int_root(x,y) bind(c, name="int_sqrt_fort")
USE, INTRINSIC :: ISO_C_BINDING
implicit none
integer(kind=c_int), intent(in) :: x
integer(kind=c_int), intent(out) :: y
y = floor(sqrt(x - 1d0))
end subroutine my_int_root
Then this is where they start to diverge:
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
bm_int_c/8 3.05 ns 3.05 ns 229239198
bm_int_fort/8 2.13 ns 2.13 ns 328933185
The Fortran code seems not significantly slower by this change, but the C++ code slowed down by 50%. This seems quite large. These are the assemblies:
Fortran: https://godbolt.org/z/axqqrc5E1
C++: https://godbolt.org/z/h7K75oKbn
The Fortran assembly seems pretty straight forward. It just adds conversion between double
and int
and not much else, but C++ seems to do a lot more, which I don't full understand.
Why is the C++ assembler so much more complicated? How could I improve the C++ code to achieve matching performance?
Judging the performance of programming languages, usually C is called the leader, though Fortran is often faster. New programming languages commonly use C as their reference and they are really proud to be only so much slower than C.
COBOL and FORTRAN (as they are now) as higher level than C. The abstraction of underlying machine is greater.
Logical expressions can only have the value .TRUE. or .FALSE.. A logical expression can be formed by comparing arithmetic expressions using the following relational operators: .LT. meaning < .LE. <= .
The tested Fortran programs are in source code on the website of Polyhedron Solutions Ltd.. Below are our results of the run-time benchmarks. The 64-bit compiler variants were used, with the exception of the Lahey Fortran LF compiler (formerly LF95), which exists only as a 32-bit compiler and is also listed here because of its wide distribution.
Version 7.50 of plusFORT , the premier Software Engineering toolkit for Fortran is out! Now includes HyperKWIC, Polyhedron’s new interactive Software Documentation tool. Which is the best Fortran compiler? We’re often asked that question, but there is no single answer.
The test environment is a standard PC equipped with an Intel® Core ™ i7-9700F CPU @ 3.00GHz (8 cores) and 16GB of RAM. On this both Microsoft Windows 10 and Linux Mint 19.2 are installed. The tested Fortran programs are in source code on the website of Polyhedron Solutions Ltd..
In particular, the diagnostic capabilities and the respective Fortran language scope (Fortran 95, 2003, 2008, etc.) should be noted. Regarding the latter, the tables of Ian Chivers & Jane Sleightholme are also informative.
TL;DR: You're hobbled by bad defaults and compatibility with obsolete machines: Bad defaults are gcc setting errno
for floating-point computations (despite this not being required by the C language), and compatibility with x86 machines that don't have any better SSE instructions than SSE2. If you want decent code generation, add -fno-math-errno -msse4
to the compiler flags.
Modern processor architectures that contain floating-point hardware typically offer square root calculation as a primitive operation (instruction), which sets an error flag in the floating-point environment if the operand of the square root instruction was out of range (negative). On the other hand, old instruction set architectures may not have had floating point instructions, or not have had hardware accelerated square root instructions, so the C language permits an implementation to set errno
on an out of range argument instead, but errno
being a thread-local memory location practically prevents any sane architecture from setting errno
directly from the square root instruction. To get decent performance, gcc inlines the square root calculation by calling the hardware instruction (sqrtsd
), but to set errno
, it seperately checks the sign of the argument, and calls to the library function only in case the argument was negative, so the library function can set errno
. Yes, this is crazy, but that in turn is par for the course. You can avoid this braindamage that nobody ever needs or wants by setting -fno-math-errno
in the compiler flags.
Reasonably recent x86-64 processors have more instructions than were present in the original x86-64 as first developed by AMD (which included only SSE2 vector/floating-point instructions). Among the added instructions are float/integer conversion instructions that allow controlled rounding/truncation, so this doesn't have to be implemented in software. You can get gcc to use these new instructions by specifying a target that supports these instructions, for example by using the -msse4
compiler flag. Note that this will cause the generated program to fault if it is run on a target that doesn't support these instructions, so the generated program will be less portable (though it doesn't reduce portability of the source code obviously).
As @EOF explained, always use -fno-math-errno
, and use -march=native
if you can, or at least -msse4
if you don't need the binary to run on old machines like first-gen Core 2, or AMD Phenom II. (Preferably also other good stuff like popcnt
and SSE4.2, like-march=nehalem -mtune=haswell
)
And preferably -O3
to enable SIMD auto-vectorization when possible, although that's sometimes not possible for floating point, e.g. reductions like dot product or sum of an array, because FP math isn't quite associative.
(-ffast-math
will let compilers pretend it is, or you can use #pragma omp simd reduction(+:my_var_name)
to give it license to rearrange order of operations in just one loop. But that's not safe in general, e.g. if your code does stuff like Kahan summation to compensate for FP rounding error, and enables other optimizations like treating denormals as 0.)
You can get even better asm out of gfortran / gcc by omitting the floor()
. Neither seems to realize that any result of sqrt will be non-negative or NaN, and thus floor(sqrt) = trunc(sqrt)
(Related: if you want to round a double to the nearest int, rather than floor or trunc, use lrint(x)
or (int)nearbyint(x)
, which can inline to an instruction that uses the current FP rounding mode, which (unless you changed it) will be round-to-nearest (even as tiebreak). See this answer for how it compiles for x86-64 and AArch64.)
GCC's back-end doesn't optimize away floor (with either front-end language), -msse4
just makes it cheap enough for the throughput bottleneck of sqrtsd
to hide its cost in your C benchmark. Specifically, we get
# gcc -O2 -fno-math-errno -msse4 with floor() still in the source.
sqrtsd xmm0, xmm0
roundsd xmm0, xmm0, 9 # floor separately, result as a double
cvttsd2si eax, xmm0 # convert (with Truncation) to signed int
Even without SSE4, GFortran chooses to use a flooring-conversion to int (which only has to work for the range of int
, not for doubles outside that range, which is what GCC's code was doing manually without -msse4
):
# GFortran -O2 -msse4 # with floor() # chooses not to use SSE4
sqrtsd xmm0, xmm0
cvttsd2si eax, xmm0 # eax = int trunc(sqrt result)
cvtsi2sd xmm1, eax # convert back to double
ucomisd xmm0, xmm1 # compare, setting CF if truncation rounded *up*
sbb eax, 0 # eax -= CF
Fun fact: this code avoids checking for "unordered" (PF=1) in the ucomisd
FLAGS result. CF = 1 for that case, but apparently gfortran doesn't care that it will make the result 0x7FFFFFFF instead of 0x80000000.
gcc
could have done this for C, since the behaviour is undefined if the double->int conversion result doesn't fit in an int. (Fun fact, negative double -> unsigned is also UB for that reason; the modular range-reduction rule is only for wide integral types->unsigned.) So this is a gcc missed optimization bug for C without SSE4.
When SSE4 is available, it's better to use roundsd
before conversion (in the general case where floor can't just be optimized away). Round-trip conversion to integer and back is 12 cycle latency on Skylake, so an extra maybe 7 vs. just one convert, plus ucomisd + sbb latency. vs. 8 cycles for roundsd
. And roundsd
is fewer total uops (https://uops.info). So this is a missed optimization for gfortran -msse4
which continues to use its double->int->double compare / sbb trick for floor-conversion.
floor
C (and Fortran) FP -> int conversion truncates (rounds towards 0.0
) by default. For non-negative integers, this is equivalent to floor
, so it's redundant. Since compilers don't realize that and/or don't take advantage of the fact that a sqrt
result is non-negative (or NaN), we can remove it ourselves:
#include <math.h>
int int_sqrt_c(int x){
return sqrt(x - 1.0);
//return floor(sqrt(x - 1.0));
}
I also simplified your C to take/return an int, instead of via pointers. https://godbolt.org/z/4zWeaej9T
int_sqrt_c(int):
pxor xmm0, xmm0
cvtsi2sd xmm0, edi # int -> double
subsd xmm0, QWORD PTR .LC0[rip]
sqrtsd xmm0, xmm0
cvttsd2si eax, xmm0 # truncating double -> int
ret
Exact same difference for the Fortan code, https://godbolt.org/z/EhbTjhGen - removing floor
removes the useless instructions, leaving just cvttsd2si
.
roundsd
costs 2 uops and has 8 cycle latency on Skylake (like a chain of two additions) https://uops.info/, so avoiding it takes out a decent fraction of the total latency from x -> retval, and of the front-end throughput cost.
(Your throughput benchmark will bottleneck on the back-end throughput of sqrtsd
, e.g. one per 4.5 cycles on Skylake, and OoO exec hides the latency, so your current test setup probably won't be able to measure the difference.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With