Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
"csumasm:\n"
"movsd (%rcx), %xmm0\n" //xmm0 = a
"subsd (%r8), %xmm1\n" //xmm1 - r8 (c) | y = b-c
"movapd %xmm0, %xmm2\n"
"addsd %xmm1, %xmm2\n" //xmm2 + xmm1 (y) | b = a+y
"movapd %xmm2, %xmm3\n"
"subsd %xmm0, %xmm3\n" //xmm3 - xmm0 (a) | b - a
"movapd %xmm3, %xmm0\n"
"subsd %xmm1, %xmm0\n" //xmm0 - xmm1 (y) | - y
"movsd %xmm0, (%r8)\n" //xmm0 to c
"movsd %xmm2, (%rcx)\n" //b to a
"ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
double y = b-c; //y is the correction of b argument
b = a+y; //add corrected b argument to a argument. The output of the current summation
c = (b-a)-y; //find new error to be passed as a compensation term
a = b;
}
double fun(double fMin, double fMax){
double f = (double)rand()/RAND_MAX;
return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
int N = 1000;
srand(0); //use 0 seed for each method
double sum1 = 0;
for (int n = 0; n < N; ++n)
sum1 += fun(0,1);
srand(0);
double sum2 = 0;
double c = 0; //compensation term
for (int n = 0; n < N; ++n)
csum(sum2,fun(0,1),c);
srand(0);
double sum3 = 0;
c = 0;
for (int n = 0; n < N; ++n)
csumasm(sum3,fun(0,1),c);
printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
return 0;
}
The output with -O3 is:
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.
Ideas or suggestions are appreciated.
Further information
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
subsd xmm0, QWORD PTR [rsp+32]
movapd xmm1, xmm3
addsd xmm3, xmm0
movsd QWORD PTR [rsp+16], xmm3
subsd xmm3, xmm1
movapd xmm1, xmm3
subsd xmm1, xmm0
movsd QWORD PTR [rsp+32], xmm1
You can put functions that need to not use -ffast-math
(like a csum loop) in a separate file that gets compiled without -ffast-math
.
Possibly you could also use __attribute__((optimize("no-fast-math")))
, but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.
update: apparently part of the question was based on a misunderstanding that -O3
wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math
. Compiling everything with just -O3
apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math
.
ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast
= -O3 -ffast-math
)
You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).
e.g. use SSE2 __m128d
for 2 packed additions per instruction. Or AVX __m256d
. On modern x86, addpd
/ subpd
have the same performance as addsd
and subsd
(1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).
So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.
Generating random numbers on the fly with your fun()
is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.
If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.
Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)
Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.
And BTW, clang
managed to implement csum(double&, double, double&):
with only two movapd
instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.
And BTW, movaps
is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double
vs. float
, just vec-FP vs. vec-int (vs. GP integer)
But really by far your bet is to get GCC to compile a file or function without -ffast-math
. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps
instructions when AVX is available, besides letting it optimize better when unrolling.
If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum
in a separate file. (Hopefully link-time optimization respects -fno-fast-math
for one file, perhaps by not inlining that function.)
But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.
Ideally compile all of your code with -O3 -march=native
, and profile-guided optimization. Also -flto
link-time optimization to enable cross-file inlining.
It's not surprising that -ffast-math
breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math
like -fno-math-errno
and -fno-trapping-math
so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno
after calling sqrt
so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math
is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.
Or maybe try -ffast-math -fno-associative-math
for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.
Another way to get optimizations that normally require fast-math is #pragma omp simd
to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With