Optimize multiply and add

Tags:

optimization

I'm using C and I have two non-negative integers n and m (both >= 0, n < 500). I need to form the product

n*(n+1)/2 + m

and this will be needed hundreds of millions of times so I want to optimize this as much as possible. My current implementation is:

Click to copy

inline int func(const int n, const int m) { return ( (n*(n+1) >> 1) + m); }

using inline and the >> 1 to do the divide by 2. Is there any other way to speed up this calculation?

720

asked Aug 30 '19 03:08

Video Answer

2 Answers

Given that n will be less that 500, you can precompute all possible values of n*(n+1)/2 and put them in a table, then use that table to perform the computation:

Click to copy

int n_sum[500];

// call this once at program start
void init_sum()
{
    int i;
    for (i=0;i<500;i++) {
        n_sum[i] = i*(i+1)/2;
    }
}

inline int func(const int n, const int m)
{ 
    return n_sum[n] + m;
}

170

answered Sep 27 '22 18:09

In practice, what you want to do is write a loop that the compiler can easily and efficiently vectorize and parallelize. If you have two arrays n[i] and m[i], any modern compiler can probably figure out how to optimize n[i]*(n[i]+1)/2 + m[i] if given the proper flags. Trying to force the compiler to do optimizations on one word at a time will, in general, be counterproductive. Modern hardware is fastest when you parallelize your critical loops. If you don’t want to use non-portable intrinsics or libraries designed for the purpose, you can best achieve that by minimizing data dependencies and writing code that is easy to statically analyze.

You might or might not be able to improve the generated code with (n*n + n)/2 + m, that is, converting the polynomial to nested form. This is efficient because it enables the code generator to use just one vector register as the accumulator, maximizing the number available for SIMD. You should use restrict and alignas as appropriate to enable maximum optimization.

(Edit: Right-shift of a negative number is implementation-defined, because it might be logical or arithmetic. The code I wrote does unsigned math, which lets the compiler optimize /2 to >>1 for you. In a comment, robthebloke brings up that, if you use signed rather than unsigned variables, and you know that they will always be non-negative, the compiler might not be able to statically deduce this and therefore might not optimize /2 to >>1. In that case, you can either write >>1 or cast (uint32_t)n[i] to do better-defined unsigned math. An unsafe-math optimization flag might also re-enable this.)

This kind of vectorization is likely to be faster than individual table lookups on each element.

The result will be in the range from 0 to 125,750, which is too large for an unsigned short, and therefore the smallest type that could hold it is int32_t or uint32_t. (Or uint_least32_t if you prefer.) Using an array of the smallest type allows maximum vectorization.

If you want to help the optimizer out, you can enable OpenMP and add a #pragma omp simd, to explicitly tell the compiler to vectorize this loop. You could also use OpenMP to enable multi-threading.

In C++, you have the options of std::valarray<uint32_t> or expression templates, which are very elegant ways to express an embarrassingly-parallel computation such as this one.

The following program compiles to vectorized code on GCC, Clang or ICC when given the appropriate optimization flags. Clang compiles to a loop that computes 256 elements per iteration.

Click to copy

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)
typedef uint_least32_t elem_t;

const elem_t n[N];
const elem_t m[N];
elem_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}

You can attempt to add alignas specifiers to the arrays, but this will not actually cause GCC, Clang or ICC to perform aligned loads or stores. (There is a GCC extension to enable this optimization.)

If you enable the OpenMP library (-fopenmp in GCC or Clang), you can add the line

Click to copy

#pragma omp for

immediately before the for loop, or a more complex version, and get a loop that is both multithreaded and vectorized. If there’s a way to significantly improve on that with standard, portable C, I’d love to know about it myself.

I wrote my MWE to be simple. In real-world code, you likely want to move the entire loop, of which this inner loop is part, out of main() and into a function such as

Click to copy

elem_t* func( const ptrdiff_t nelems,
              const elem_t n[nelems],
              const elem_t m[nelems],
              elem_t a[nelems]
            )
{
    for ( ptrdiff_t  i = 0; i < nelems; ++i) {
      a[i] = (n[i]*n[i] + n[i])/2 + m[i];
    }

  return a;
}

If you compare the generated assembly, you will see that it is not as efficient unless you inline it, largely because the compiler no longer knows the number of iterations at compile-time or has any information about the alignment of n, m or a.

You could also save some memory, but probably not computation time, by storing the input elements as uint16_t. The input arrays use half as much memory, but the loop cannot operate on more elements than before because the computation uses elements the same size. Be careful to cast the temporary values you use for the calculation to a type that will not overflow!

Click to copy

#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1L<<20)

const uint16_t n[N];
const uint16_t m[N];
uint32_t a[N];

int main(void)
{
    for ( ptrdiff_t  i = 0; i < N; ++i) {
      a[i] = ((uint32_t)n[i]*n[i] + n[i])/2 + m[i];
    }

  return EXIT_SUCCESS;
}

answered Sep 27 '22 19:09

Davislor

Related questions
                            
                                Why is C platform-dependent? [duplicate]
                            
                                Understanding Pthreads
                            
                                C2732 - Linkage specification error
                            
                                How are different types stored in memory
                            
                                Linux UDP Socket sendto: Operation not Permitted
                            
                                Unable to find out what return code of -11 means
                            
                                Will there be a performance hit on including unused header files in C/C++?
                            
                                Is it possible to enable link-time optimization while only disabling strict aliasing for some functions?
                            
                                How can I differentiate static functions with nm or readelf output in C
                            
                                GCC Assembly Optimizations - Why are these equivalent?
                            
                                inotify event IN_MODIFY occurring twice for tftp put
                            
                                c - reading a specific column in data file
                            
                                How to disable GNU C extensions?
                            
                                Mac OS Sierra: checking whether the C compiler works... no
                            
                                Clarify to understand C variables
                            
                                No usable IPDB/IOBJ from previous compilation was found
                            
                                Why is scanf not allowing any more input?
                            
                                What is the difference between vsnprintf and vsprintf_s?
                            
                                How to restart stdin after Ctrl+D?
                            
                                How to suppress sprintf() warning 'directive writing between 1 and 11 bytes into a region of size 6 '

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimize multiply and add

Tags:

c

optimization

vibe

People also ask

Video Answer

2 Answers

dbush

Davislor

Recent Activity

Donate For Us