Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize multiply and add

Tags:

c

optimization

I'm using C and I have two non-negative integers n and m (both >= 0, n < 500). I need to form the product

n*(n+1)/2 + m

and this will be needed hundreds of millions of times so I want to optimize this as much as possible. My current implementation is:

inline int func(const int n, const int m) { return ( (n*(n+1) >> 1) + m); }

using inline and the >> 1 to do the divide by 2. Is there any other way to speed up this calculation?

like image 720
vibe Avatar asked Aug 30 '19 03:08

vibe


People also ask

What does optimize code do?

Basically, this allows the run-time JIT to optimize the code how it sees fit, including reordering and inlining code. This will produce more efficient and smaller code, but it means that trying to debug the code can be very challenging (as anyone who's tried it will tell you).


Video Answer


2 Answers

Given that n will be less that 500, you can precompute all possible values of n*(n+1)/2 and put them in a table, then use that table to perform the computation:

int n_sum[500];

// call this once at program start
void init_sum()
{
    int i;
    for (i=0;i<500;i++) {
        n_sum[i] = i*(i+1)/2;
    }
}

inline int func(const int n, const int m)
{ 
    return n_sum[n] + m;
}
like image 170
dbush Avatar answered Sep 27 '22 18:09

dbush


In practice, what you want to do is write a loop that the compiler can easily and efficiently vectorize and parallelize. If you have two arrays n[i] and m[i], any modern compiler can probably figure out how to optimize n[i]*(n[i]+1)/2 + m[i] if given the proper flags. Trying to force the compiler to do optimizations on one word at a time will, in general, be counterproductive. Modern hardware is fastest when you parallelize your critical loops. If you don’t want to use non-portable intrinsics or libraries designed for the purpose, you can best achieve that by minimizing data dependencies and writing code that is easy to statically analyze.

You might or might not be able to improve the generated code with (n*n + n)/2 + m, that is, converting the polynomial to nested form. This is efficient because it enables the code generator to use just one vector register as the accumulator, maximizing the number available for SIMD. You should use restrict and alignas as appropriate to enable maximum optimization.

(Edit: Right-shift of a negative number is implementation-defined, because it might be logical or arithmetic. The code I wrote does unsigned math, which lets the compiler optimize /2 to >>1 for you. In a comment, robthebloke brings up that, if you use signed rather than unsigned variables, and you know that they will always be non-negative, the compiler might not be able to statically deduce this and therefore might not optimize /2 to >>1. In that case, you can either write >>1 or cast (uint32_t)n[i] to do better-defined unsigned math. An unsafe-math optimization flag might also re-enable this.)

This kind of vectorization is likely to be faster than individual table lookups on each element.

The result will be in the range from 0 to 125,750, which is too large for an unsigned short, and therefore the smallest type that could hold it is int32_t or uint32_t. (Or uint_least32_t if you prefer.) Using an array of the smallest type allows maximum vectorization.

If you want to help the optimizer out, you can enable OpenMP and add a #pragma omp simd, to explicitly tell the compiler to vectorize this loop. You could also use OpenMP to enable multi-threading.

In C++, you have the options of std::valarray<uint32_t> or expression templates, which are very elegant ways to express an embarrassingly-parallel computation such as this one.

The following program compiles to vectorized code on GCC, Clang or ICC when given the appropriate optimization flags. Clang compiles to a loop that computes 256 elements per iteration.

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)
typedef uint_least32_t elem_t;

const elem_t n[N];
const elem_t m[N];
elem_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}

You can attempt to add alignas specifiers to the arrays, but this will not actually cause GCC, Clang or ICC to perform aligned loads or stores. (There is a GCC extension to enable this optimization.)

If you enable the OpenMP library (-fopenmp in GCC or Clang), you can add the line

#pragma omp for

immediately before the for loop, or a more complex version, and get a loop that is both multithreaded and vectorized. If there’s a way to significantly improve on that with standard, portable C, I’d love to know about it myself.

I wrote my MWE to be simple. In real-world code, you likely want to move the entire loop, of which this inner loop is part, out of main() and into a function such as

elem_t* func( const ptrdiff_t nelems,
              const elem_t n[nelems],
              const elem_t m[nelems],
              elem_t a[nelems]
            )
{
    for ( ptrdiff_t  i = 0; i < nelems; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return a;
}

If you compare the generated assembly, you will see that it is not as efficient unless you inline it, largely because the compiler no longer knows the number of iterations at compile-time or has any information about the alignment of n, m or a.

You could also save some memory, but probably not computation time, by storing the input elements as uint16_t. The input arrays use half as much memory, but the loop cannot operate on more elements than before because the computation uses elements the same size. Be careful to cast the temporary values you use for the calculation to a type that will not overflow!

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)

const uint16_t n[N];
const uint16_t m[N];
uint32_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = ((uint32_t)n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}
like image 38
Davislor Avatar answered Sep 27 '22 19:09

Davislor