Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient modulo-255 computation

I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.

The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2, WARREN_MUL_SHR_1, and DIGIT_SUM_CARRY_OUT_1 seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA, byte-expansion instructions, multiply-accumulate, and instruction predication.

The variant NAIVE_USING_DIV uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1 one can sum base digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD uses this method.

A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20

showed a different way of adding digits modulo base-1 efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b, ADC a, 0 sequence. Writing out the addition chain for this using base 256 digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base digit comprises 2k bits. Here we have k=3. I tried three different ways of remapping a result of base-1 to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1, DIGIT_SUM_CARRY_OUT_2, DIGIT_SUM_CARRY_OUT_3.

An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));. This makes it a close relative of the next variant.

Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2k-1 = floor (2k / 2k-1 * n) mod 2k. Fixed point computation is used to multiply with the factor 2k / 2k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1 to 0. These are variants WARREN_MUL_SHR_1 and WARREN_MUL_SHR_2.

Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256 digits would seem to be of particular interest in this context.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NAIVE_USING_DIV       (1)
#define DIGIT_SUM_THEN_FOLD   (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC           (6)  // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1      (7)  // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2      (8)  // Hacker's Delight, 2nd ed., p. 272

#define VARIANT (WARREN_MUL_SHR_2)

uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
    return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
    x = (x & 0xffff) + (x >> 16);
    x = (x & 0xff) + (x >> 8);
    x = (x & 0xff) + (x >> 8) + 1;
    x = (x & 0xff) + (x >> 8) - 1;
    return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x);
    if (t == 255) t = 0;
    return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x) + 1;
    t = (t & 0xff) + (t >> 8) - 1;
    return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x);
    t = t & ((t - 255) >> 8);
    return t;
#elif VARIANT == KEANE_MAGIC
    x = (((x >> 16) + x) >> 14) + (x << 2);
    x = ((x >> 8) + x + 2) & 0x3ff;
    x = (x - (x >> 8)) >> 2;
    return x;
#elif VARIANT == WARREN_MUL_SHR_1
    x = (0x01010101 * x + (x >> 8)) >> 24;
    x = x & ((x - 255) >> 8);
    return x;
#elif VARIANT == WARREN_MUL_SHR_2
    x = (0x01010101 * x + (x >> 8)) >> 24;
    if (x == 255) x = 0;
    return x;
#else
#error unknown VARIANT
#endif
}

uint32_t ref_mod255 (uint32_t x)
{
    volatile uint32_t t = x;
    t = t % 255;
    return t;
}

// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
    double start, stop;
    uint32_t res, ref, x = 0;

    printf ("Testing VARIANT = %d\n", VARIANT);
    start = second();
    do {
        res = mod255 (x);
        ref = ref_mod255 (x);
        if (res != ref) {
            printf ("error @ %08x: res=%08x ref=%08x\n", x, res, ref);
            return EXIT_FAILURE;
        }        
        x++;
    } while (x);
    stop = second();
    printf ("test passed\n");
    printf ("elapsed = %.6f seconds\n", stop - start);
    return EXIT_SUCCESS;
}
like image 927
njuffa Avatar asked Jun 21 '21 20:06

njuffa


People also ask

What is the modulo of a number?

Modulo is a math operation that finds the remainder when one integer is divided by another. In writing, it is frequently abbreviated as mod, or represented by the symbol %. Where a is the dividend, b is the divisor (or modulus), and r is the remainder.

Is modulo or division faster?

When the modulus m is constant, even where there is a hardware divide instruction, it can be faster to take the modulus directly than to use the divide instruction. These tricks become even more valuable on machines without a hardware divide instruction or where the numbers involved are out of range.


2 Answers

For arbitrary unsigned integers, x and n, evaluating the modulo expression x % n involves (conceptually, at least), three operations: division, multiplication and subtraction:

quotient = x / n;
product = quotient * n;
modulus = x - product;

However, when n is a power of 2 (n = 2p), the modulo can be determined much more rapidly, simply by masking out all but the lower p bits.

On most CPUs, addition, subtraction and bit-masking are very 'cheap' (rapid) operations, multiplication is more 'expensive' and division is very expensive – but note that most optimizing compilers will convert division by a compile-time constant into a multiplication (by a different constant) and a bit-shift (vide infra).

Thus, if we can convert our modulo 255 into a modulo 256, without too much overhead, we can likely speed up the process. We can do just this by noting that x % n is equivalent to (x + x / n) % (n + 1). Thus, our conceptual operations are now: division, addition and masking.

In the specific case of masking the lower 8 bits, x86/x64-based CPUs (and others?) will likely be able to perform a further optimization, as they can access 8-bit versions of (most) registers.

Here's what the clang-cl compiler generates for a naïve modulo 255 function (argument passed in ecx and returned in eax):

unsigned Naive255(unsigned x)
{
    return x % 255;
}
    mov     edx, ecx
    mov     eax, 2155905153 ;
    imul    rax, rdx        ; Replacing the IDIV with IMUL and SHR
    shr     rax, 39         ;
    mov     edx, eax
    shl     edx, 8
    sub     eax, edx
    add     eax, ecx

And here's the (clearly faster) code generated using the 'trick' described above:

unsigned Trick255(unsigned x)
{
    return (x + x / 255) & 0xFF;
}
    mov     eax, ecx
    mov     edx, 2155905153
    imul    rdx, rax
    shr     rdx, 39
    add     edx, ecx
    movzx   eax, dl         ; Faster than an explicit AND mask?

Testing this code on a Windows-10 (64-bit) platform (Intel® Core™ i7-8550U CPU) shows that it significantly (but not hugely) out-performs the other algorithms presented in the question.


The answer given by David Eisenstat explains how/why this equivalence is valid.

like image 141
Adrian Mole Avatar answered Oct 03 '22 02:10

Adrian Mole


Here’s my sense of how the fastest answers work. I don’t know yet whether Keane can be improved or easily generalized.

Given an integer x ≥ 0, let q = ⌊x/255⌋ (in C, q = x / 255;) and r = x − 255 q (in C, r = x % 255;) so that q ≥ 0 and 0 ≤ r < 255 are integers and x = 255 q + r.

Adrian Mole’s method

This method evaluates (x + ⌊x/255⌋) mod 28 (in C, (x + x / 255) & 0xff), which equals (255 q + r + q) mod 28 = (28 q + r) mod 28 = r.

Henry S. Warren’s method

Note that x + ⌊x/255⌋ = ⌊x + x/255⌋ = ⌊(28/255) x⌋, where the first step follows from x being an integer. This method uses the multiplier (20 + 2−8 + 2−16 + 2−24 + 2−32) instead of 28/255, which is the sum of the infinite series 20 + 2−8 + 2−16 + 2−24 + 2−32 + …. Since the approximation is slightly under, this method must detect the residue 28 − 1 = 255.

Joe Keane’s method

The intuition for this method is to compute y = (28/255) x mod 28, which equals (28/255) (255 q + r) mod 28 = (28 q + (28/255) r) mod 28 = (28/255) r, and return y − y/28, which equals r.

Since these formulas don’t use the fact that ⌊(28/255) r⌋ = r, Keane can switch from 28 to 210 for two guard bits. Ideally, these would always be zero, but due to fixed-point truncation and an approximation for 210/255, they’re not. Keane adds 2 to switch from truncation to rounding, which also avoids the special case in Warren.

This method sort of uses the multiplier 22 (20 + 2−8 + 2−16 + 2−24 + 2−32 + 2−40) = 22 (20 + 2−16 + 2−32) (20 + 2−8). The C statement x = (((x >> 16) + x) >> 14) + (x << 2); computes x′ = ⌊22 (20 + 2−16 + 2−32) x⌋ mod 232. Then ((x >> 8) + x) & 0x3ff is x′′ = ⌊(20 + 2−8) x′⌋ mod 210.

I don’t have time right now to do the error analysis formally. Informally, the error interval of the first computation has width < 1; the second, width < 2 + 2−8; the third, width < ((2 − 2−8) + 1)/22 < 1, which allows correct rounding.

Regarding improvements, the 2−40 term of the approximation seems not necessary (?), but we might as well have it unless we can drop the 2−32 term. Dropping 2−32 pushes the approximation quality out of spec.

like image 9
David Eisenstat Avatar answered Oct 03 '22 04:10

David Eisenstat