C++ Adding 2 arrays together quickly

Tags:

Given the arrays:

int canvas[10][10];
int addon[10][10];

Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?

IE, I want to achieve something like:

Click to copy

canvas += another;

So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5

Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.

And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!

500

asked Jun 02 '10 16:06

Tom Gullen

2 Answers

Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):

Click to copy

#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>

static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
    __m128i * cp = (__m128i *)&canvas[0][0];
    const __m128i * ap = (__m128i *)&addon[0][0];
    const __m128i vlimit = _mm_set1_epi32(100);
    __m128i vmax = _mm_set1_epi32(INT_MIN);
    __m128i vcmp;
    int cmp;
    int i;

    for (i = 0; i < 10 * 10; i += 4)
    {
        __m128i vc = _mm_loadu_si128(cp);
        __m128i va = _mm_loadu_si128(ap);

        vc = _mm_add_epi32(vc, va);
        vmax = _mm_max_epi32(vmax, vc);   // SSE4 *

        _mm_storeu_si128(cp, vc);

        cp++;
        ap++;
    }
    vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
    cmp = _mm_testz_si128(vcmp, vcmp);    // SSE4 *
    return cmp == 0;
}

Compile with gcc -msse4.1 ... or equivalent for your particular development environment.

For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

answered Oct 06 '22 00:10

Paul R

You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.

MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.

answered Oct 06 '22 01:10

A. Levy

Related questions
                            
                                A lightweight XML parser efficient for large files?
                            
                                Why does QGraphicsItem::scenePos() keep returning (0,0)
                            
                                How linker resolves the symbol in assembly code
                            
                                Best Type for UTF-8 data?
                            
                                How to handle EINTR (interrupted System Call)
                            
                                How to detect programmatically count of bytes allocated by process on Heap?
                            
                                Getting Union, Intersection, or Difference of Sets in C++
                            
                                Is there any regular expression engine that does Just-In-Time compiling? [closed]
                            
                                How many requests can SQL Server handle per second?
                            
                                Why is Application Binary Interface important for programming
                            
                                Switch optimization for many cases guarantees equal access time for any case? ( C++ )
                            
                                Efficiency of Bitwise XOR in c++ in comparison to more readable methods
                            
                                C++ constructors fun - constructing Foo with a copy of itself
                            
                                Good C++ array class for dealing with large arrays of data in a fast and memory efficient way?
                            
                                WOW64: get x64 %CommonProgramFiles% from 32 bit process
                            
                                Where is pure virtual function located in C++?
                            
                                Express any number as the sum of four prime numbers
                            
                                How to get information about a Windows executable (.exe) using C++
                            
                                how to open a file in user's home folder
                            
                                STL map containing references does not compile

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C++ Adding 2 arrays together quickly

Tags:

c++

performance

arrays

micro-optimization

simd

Tom Gullen

People also ask

2 Answers

Paul R

A. Levy

Recent Activity

Donate For Us