Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient computation of the average of three unsigned integers (without overflow)

There is an existing question "Average of 3 long integers" that is specifically concerned with the efficient computation of the average of three signed integers.

The use of unsigned integers however allows for additional optimizations not applicable to the scenario covered in the previous question. This question is about the efficient computation of the average of three unsigned integers, where the average is rounded towards zero, i.e. in mathematical terms I want to compute ⌊ (a + b + c) / 3 ⌋.

A straightforward way to compute this average is

 avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;

To first order, modern optimizing compilers will transform the divisions into multiplications with a reciprocal plus a shift, and the modulo operations into a back-multiply and a subtraction, where the back-multiply may use a scale_add idiom available on many architectures, e.g. lea on x86_64, add with lsl #n on ARM, iscadd on NVIDIA GPUs.

In trying to optimize the above in a generic fashion suitable for many common platforms, I observe that typically the cost of integer operations is in the relationship logical ≤ (add | sub) ≤ shiftscale_addmul. Cost here refers to all of latency, throughput limitations, and power consumption. Any such differences become more pronounced when the integer type processed is wider than the native register width, e.g. when processing uint64_t data on a 32-bit processor.

My optimization strategy was therefore to minimize instruction count and replace "expensive" with "cheap" operations where possible, while not increasing register pressure and retaining exploitable parallelism for wide out-of-order processors.

The first observation is that we can reduce a sum of three operands into a sum of two operands by first applying a CSA (carry save adder) that produces a sum value and a carry value, where the carry value has twice the weight of the sum value. The cost of a software-based CSA is five logicals on most processors. Some processors, like NVIDIA GPUs, have a LOP3 instruction that can compute an arbitrary logical expression of three operands in one fell swoop, in which case CSA condenses to two LOP3s (note: I have yet convince the CUDA compiler to emit those two LOP3s; it currently produces four LOP3s!).

The second observation is that because we are computing the modulo of division by 3, we don't need a back-multiply to compute it. We can instead use dividend % 3 = ((dividend / 3) + dividend) & 3, reducing the modulo to an add plus a logical since we already have the division result. This is an instance of the general algorithm: dividend % (2n-1) = ((dividend / (2n-1) + dividend) & (2n-1).

Finally for the division by 3 in the correction term (a % 3 + b % 3 + c % 3) / 3 we don't need the code for generic division by 3. Since the dividend is very small, in [0, 6], we can simplify x / 3 into (3 * x) / 8 which requires just a scale_add plus a shift.

The code below shows my current work-in-progress. Using Compiler Explorer to check the code generated for various platforms shows the tight code I would expect (when compiled with -O3).

However, in timing the code on my Ivy Bridge x86_64 machine using the Intel 13.x compiler, a flaw became apparent: while my code improves latency (from 18 cycles to 15 cycles for uint64_t data) compared to the simple version, throughput worsens (from one result every 6.8 cycles to one result every 8.5 cycles for uint64_t data). Looking at the assembly code more closely it is quite apparent why that is: I basically managed to take the code down from roughly three-way parallelism to roughly two-way parallelism.

Is there a generically applicable optimization technique, beneficial on common processors in particular all flavors of x86 and ARM as well as GPUs, that preserves more parallelism? Alternatively, is there an optimization technique that further reduces overall operation count to make up for reduced parallelism? The computation of the correction term (tail in the code below) seems like a good target. The simplification (carry_mod_3 + sum_mod_3) / 2 looked enticing but delivers an incorrect result for one of the nine possible combinations.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define BENCHMARK           (1)
#define SIMPLE_COMPUTATION  (0)

#if BENCHMARK
#define T uint64_t
#else // !BENCHMARK
#define T uint8_t
#endif // BENCHMARK

T average_of_3 (T a, T b, T c) 
{
    T avg;

#if SIMPLE_COMPUTATION
    avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;
#else // !SIMPLE_COMPUTATION
    /* carry save adder */
    T a_xor_b = a ^ b;
    T sum = a_xor_b ^ c;
    T carry = (a_xor_b & c) | (a & b);
    /* here 2 * carry + sum = a + b + c */
    T sum_div_3 = (sum / 3);                                   // {MUL|MULHI}, SHR
    T sum_mod_3 = (sum + sum_div_3) & 3;                       // ADD, AND

    if (sizeof (size_t) == sizeof (T)) { // "native precision" (well, not always)
        T two_carry_div_3 = (carry / 3) * 2;                   // MULHI, ANDN
        T two_carry_mod_3 = (2 * carry + two_carry_div_3) & 6; // SCALE_ADD, AND
        T head = two_carry_div_3 + sum_div_3;                  // ADD
        T tail = (3 * (two_carry_mod_3 + sum_mod_3)) / 8;      // ADD, SCALE_ADD, SHR
        avg = head + tail;                                     // ADD
    } else {
        T carry_div_3 = (carry / 3);                           // MUL, SHR
        T carry_mod_3 = (carry + carry_div_3) & 3;             // ADD, AND
        T head = (2 * carry_div_3 + sum_div_3);                // SCALE_ADD
        T tail = (3 * (2 * carry_mod_3 + sum_mod_3)) / 8;      // SCALE_ADD, SCALE_ADD, SHR
        avg = head + tail;                                     // ADD
    }
#endif // SIMPLE_COMPUTATION
    return avg;
}

#if !BENCHMARK
/* Test correctness on 8-bit data exhaustively. Should catch most errors */
int main (void)
{
    T a, b, c, res, ref;
    a = 0;
    do {
        b = 0;
        do {
            c = 0;
            do {
                res = average_of_3 (a, b, c);
                ref = ((uint64_t)a + (uint64_t)b + (uint64_t)c) / 3;
                if (res != ref) {
                    printf ("a=%08x  b=%08x  c=%08x  res=%08x  ref=%08x\n", 
                            a, b, c, res, ref);
                    return EXIT_FAILURE;
                }
                c++;
            } while (c);
            b++;
        } while (b);
        a++;
    } while (a);
    return EXIT_SUCCESS;
}

#else // BENCHMARK

#include <math.h>

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

#define N  (3000000)
int main (void)
{
    double start, stop, elapsed = INFINITY;
    int i, k;
    T a, b;
    T avg0  = 0xffffffff,  avg1 = 0xfffffffe;
    T avg2  = 0xfffffffd,  avg3 = 0xfffffffc;
    T avg4  = 0xfffffffb,  avg5 = 0xfffffffa;
    T avg6  = 0xfffffff9,  avg7 = 0xfffffff8;
    T avg8  = 0xfffffff7,  avg9 = 0xfffffff6;
    T avg10 = 0xfffffff5, avg11 = 0xfffffff4;
    T avg12 = 0xfffffff2, avg13 = 0xfffffff2;
    T avg14 = 0xfffffff1, avg15 = 0xfffffff0;

    a = 0x31415926;
    b = 0x27182818;
    avg0 = average_of_3 (a, b, avg0);
    for (k = 0; k < 5; k++) {
        start = second();
        for (i = 0; i < N; i++) {
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            b = (b + avg0) ^ a;
            a = (a ^ b) + avg0;
        }
        stop = second();
        elapsed = fmin (stop - start, elapsed);
    }
    printf ("a=%016llx b=%016llx avg=%016llx", 
            (uint64_t)a, (uint64_t)b, (uint64_t)avg0);
    printf ("\rlatency:    each average_of_3() took  %.6e seconds\n", 
            elapsed / 16 / N);


    a = 0x31415926;
    b = 0x27182818;
    avg0 = average_of_3 (a, b, avg0);
    for (k = 0; k < 5; k++) {
        start = second();
        for (i = 0; i < N; i++) {
            avg0  = average_of_3 (a, b, avg0);
            avg1  = average_of_3 (a, b, avg1);
            avg2  = average_of_3 (a, b, avg2);
            avg3  = average_of_3 (a, b, avg3);
            avg4  = average_of_3 (a, b, avg4);
            avg5  = average_of_3 (a, b, avg5);
            avg6  = average_of_3 (a, b, avg6);
            avg7  = average_of_3 (a, b, avg7);
            avg8  = average_of_3 (a, b, avg8);
            avg9  = average_of_3 (a, b, avg9);
            avg10 = average_of_3 (a, b, avg10);
            avg11 = average_of_3 (a, b, avg11);
            avg12 = average_of_3 (a, b, avg12);
            avg13 = average_of_3 (a, b, avg13);
            avg14 = average_of_3 (a, b, avg14);
            avg15 = average_of_3 (a, b, avg15);
            b = (b + avg0) ^ a;
            a = (a ^ b) + avg0;
        }
        stop = second();
        elapsed = fmin (stop - start, elapsed);
    }
    printf ("a=%016llx b=%016llx avg=%016llx", (uint64_t)a, (uint64_t)b, 
            (uint64_t)(avg0 + avg1 + avg2 + avg3 + avg4 + avg5 + avg6 + avg7 + 
                       avg8 + avg9 +avg10 +avg11 +avg12 +avg13 +avg14 +avg15));
    printf ("\rthroughput: each average_of_3() took  %.6e seconds\n", 
            elapsed / 16 / N);

    return EXIT_SUCCESS;
}

#endif // BENCHMARK
like image 283
njuffa Avatar asked Oct 27 '20 21:10

njuffa


People also ask

What is the average of unsigned integers 32 bits wide?

unsigned average (unsigned a, unsigned b) { return (a + b) / 2; } However, this gives the wrong answer in the face of integer overflow: For example, if unsigned integers are 32 bits wide, then it says that average (0x80000000U, 0x80000000U) is zero.

How do you find the average of two integers?

Finding the average of two unsigned integers, rounding toward zero, sounds easy: unsigned average (unsigned a, unsigned b) { return (a + b) / 2; } However, this gives the wrong answer in the face of integer overflow: For example, if unsigned integers are 32 bits wide, then it says that average (0x80000000U, 0x80000000U) is zero.

What is the average of the two numbers by standard method?

The average of the two numbers by standard method is (sum / 2). Since the sum of the two numbers exceed INT_MAX, the obtained output by standard method is incorrect. Approach: The given problem can be solved based on the following observations: The average of N array elements can be obtained by dividing the sum of the array elements by N.

How to cast an unsigned integer to a larger number?

unsigned average (unsigned a, unsigned b) { return (a & b) + (a ^ b) / 2; } If your compiler supports integers larger than the size of an unsigned, say because unsigned is a 32-bit value but the native register size is 64-bit, or because the compiler supports multiword arithmetic, then you can cast to the larger data type:


Video Answer


3 Answers

Let me throw my hat in the ring. Not doing anything too tricky here, I think.

#include <stdint.h>

uint64_t average_of_three(uint64_t a, uint64_t b, uint64_t c) {
  uint64_t hi = (a >> 32) + (b >> 32) + (c >> 32);
  uint64_t lo = hi + (a & 0xffffffff) + (b & 0xffffffff) + (c & 0xffffffff);
  return 0x55555555 * hi + lo / 3;
}

Following discussion below about different splits, here's a version that saves a multiply at the expense of three bitwise-ANDs:

T hi = (a >> 2) + (b >> 2) + (c >> 2);
T lo = (a & 3) + (b & 3) + (c & 3);
avg = hi + (hi + lo) / 3;
like image 130
David Eisenstat Avatar answered Oct 23 '22 00:10

David Eisenstat


I'm not sure if it fits your requirements, but maybe it works to just calculate the result and then fixup the error from the overflow:

T average_of_3 (T a, T b, T c)
{
    T r = ((T) (a + b + c)) / 3;
    T o = (a > (T) ~b) + ((T) (a + b) > (T) (~c));
    if (o) r += ((T) 0x5555555555555555) << (o - 1);
    T rem = ((T) (a + b + c)) % 3;
    if (rem >= (3 - o)) ++r;
    return r;
}

[EDIT] Here is the best branch-and-compare-less version I can come up with. On my machine, this version actually has slightly higher throughput than njuffa's code. __builtin_add_overflow(x, y, r) is supported by gcc and clang and returns 1 if the sum x + y overflows the type of *r and 0 otherwise, so the calculation of o is equivalent to the portable code in the first version, but at least gcc produces better code with the builtin.

T average_of_3 (T a, T b, T c)
{
    T r = ((T) (a + b + c)) / 3;
    T rem = ((T) (a + b + c)) % 3;
    T dummy;
    T o = __builtin_add_overflow(a, b, &dummy) + __builtin_add_overflow((T) (a + b), c, &dummy);
    r += -((o - 1) & 0xaaaaaaaaaaaaaaab) ^ 0x5555555555555555;
    r += (rem + o + 1) >> 2;
    return r;
}
like image 6
Falk Hüffner Avatar answered Oct 23 '22 00:10

Falk Hüffner


I answered the question you linked to already, so I am only answering the part that is different about this one: performance.

If you really cared about performance, then the answer is:

( a + b + c ) / 3

Since you cared about performance, you should have an intuition about the size of data you are working with. You should not have worried about overflow on addition (multiplication is another matter) of only 3 values, because if your data is already big enough to use the high bits of your chosen data type, you are in danger of overflow anyway and should have used a larger integer type. If you are overflowing on uint64_t, then you should really ask yourself why exactly do you need to count accurately up to 18 quintillion, and perhaps consider using float or double.

Now, having said all that, I will give you my actual reply: It doesn't matter. The question doesn't come up in real life and when it does, perf doesn't matter.

It could be a real performance question if you are doing it a million times in SIMD, because there, you are really incentivized to use integers of smaller width and you may need that last bit of headroom, but that wasn't your question.

like image 5
KevinZ Avatar answered Oct 23 '22 02:10

KevinZ