Optimal implementation of iterative Kahan summation

Question

Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.

Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.

Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.

Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

extern "C" void csumasm(double&, double, double&);
__asm__(
    "csumasm:
"
    "movsd  (%rcx), %xmm0
" //xmm0 = a
    "subsd  (%r8), %xmm1
"  //xmm1 - r8 (c) | y = b-c
    "movapd %xmm0, %xmm2
"  
    "addsd  %xmm1, %xmm2
"  //xmm2 + xmm1 (y) | b = a+y
    "movapd %xmm2, %xmm3
" 
    "subsd  %xmm0, %xmm3
"  //xmm3 - xmm0 (a) | b - a
    "movapd %xmm3, %xmm0
"  
    "subsd  %xmm1, %xmm0
"  //xmm0 - xmm1 (y) | - y
    "movsd  %xmm0, (%r8)
"  //xmm0 to c
    "movsd  %xmm2, (%rcx)
" //b to a
    "ret
"
);

void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
    double y = b-c; //y is the correction of b argument
    b = a+y; //add corrected b argument to a argument. The output of the current summation
    c = (b-a)-y; //find new error to be passed as a compensation term
    a = b;
}

double fun(double fMin, double fMax){
    double f = (double)rand()/RAND_MAX;
    return fMin + f*(fMax - fMin); //returns random value
}

int main(int argc, char** argv) {
    int N = 1000;

    srand(0); //use 0 seed for each method
    double sum1 = 0;
    for (int n = 0; n < N; ++n)
        sum1 += fun(0,1);

    srand(0);
    double sum2 = 0;
    double c = 0; //compensation term
    for (int n = 0; n < N; ++n)
        csum(sum2,fun(0,1),c);

    srand(0);
    double sum3 = 0;
    c = 0;
    for (int n = 0; n < N; ++n)
        csumasm(sum3,fun(0,1),c);

    printf("Standard summation:
 %.16e (error: %.16e)

",sum1,sum1-sum3);
    printf("Kahan compensated summation [g++]:
 %.16e (error: %.16e)

",sum2,sum2-sum3);
    printf("Kahan compensated summation [asm]:
 %.16e
",sum3);
    return 0;
}

The output with -O3 is:

Standard summation:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)

Kahan compensated summation [g++]:
 5.1991955320902127e+002 (error: 0.0000000000000000e+000)

Kahan compensated summation [asm]:
 5.1991955320902127e+002

The output with -O3 -ffast-math

Standard summation:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)

Kahan compensated summation [g++]:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)

Kahan compensated summation [asm]:
 5.1991955320902127e+002

It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.

Question

Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.

Ideas or suggestions are appreciated.

Further information

The contents of "fun" cannot be inlined, but the "csum" function could be.
The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.

Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)

        subsd   xmm0, QWORD PTR [rsp+32]
        movapd  xmm1, xmm3
        addsd   xmm3, xmm0
        movsd   QWORD PTR [rsp+16], xmm3
        subsd   xmm3, xmm1
        movapd  xmm1, xmm3
        subsd   xmm1, xmm0
        movsd   QWORD PTR [rsp+32], xmm1

Peter Cordes · Accepted Answer

You can put functions that need to not use -ffast-math (like a csum loop) in a separate file that gets compiled without -ffast-math.

Possibly you could also use __attribute__((optimize("no-fast-math"))), but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.

update: apparently part of the question was based on a misunderstanding that -O3 wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math. Compiling everything with just -O3 apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math.

ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast = -O3 -ffast-math)

You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).

e.g. use SSE2 __m128d for 2 packed additions per instruction. Or AVX __m256d. On modern x86, addpd / subpd have the same performance as addsd and subsd (1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).

So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.

Generating random numbers on the fly with your fun() is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.

If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.

Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)

Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.

And BTW, clang managed to implement csum(double&, double, double&): with only two movapd instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.

And BTW, movaps is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double vs. float, just vec-FP vs. vec-int (vs. GP integer)

But really by far your bet is to get GCC to compile a file or function without -ffast-math. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps instructions when AVX is available, besides letting it optimize better when unrolling.

If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum in a separate file. (Hopefully link-time optimization respects -fno-fast-math for one file, perhaps by not inlining that function.)

But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.

Ideally compile all of your code with -O3 -march=native, and profile-guided optimization. Also -flto link-time optimization to enable cross-file inlining.

It's not surprising that -ffast-math breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math like -fno-math-errno and -fno-trapping-math so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno after calling sqrt so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.

Or maybe try -ffast-math -fno-associative-math for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.

Another way to get optimizations that normally require fast-math is #pragma omp simd to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.

Optimal implementation of iterative Kahan summation

Tags:

c++

fast-math

x86-64

sse

inline-assembly

Egeris

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Optimal implementation of iterative Kahan summation

Tags:

c++

fast-math

x86-64

sse

inline-assembly

Egeris

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us