Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIMD the following code

Tags:

c

x86

simd

sse

How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot:

int sum_naive( int n, int *a )
{
    int sum = 0;
    for( int i = 0; i < n; i++ )
        sum += a[i];
    return sum;
}
like image 257
user1585869 Avatar asked Aug 08 '12 20:08

user1585869


People also ask

What is SIMD code?

SIMD is short for Single Instruction/Multiple Data, while the term SIMD operations refers to a computing method that enables processing of multiple data with a single instruction. In contrast, the conventional sequential approach using one instruction to process each individual data is called scalar operations.

Does C++ use SIMD?

One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.

Which is the example of SIMD processor?

Wireless MMX Technology The Wireless MMX unit is an example of a SIMD coprocessor. It is a 64-bit architecture that is an extension of the XScale microarchitecture programming model. Wireless MMX technology defines three packed data types (8-bit byte, 16-bit half word, and 32-bit word) and the 64-bit double word.

How many units is a SIMD?

These registers are split into four banks such that there are 256 registers per SIMD unit, each 64 lanes wide and 32 bits per lane.


1 Answers

Here's a fairly straightforward implementation (warning: untested code):

int32_t sum_array(const int32_t a[], const int n)
{
    __m128i vsum = _mm_set1_epi32(0);       // initialise vector of four partial 32 bit sums
    int32_t sum;
    int i;

    for (i = 0; i < n; i += 4)
    {
        __m128i v = _mm_load_si128(&a[i]);  // load vector of 4 x 32 bit values
        vsum = _mm_add_epi32(vsum, v);      // accumulate to 32 bit partial sum vector
    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

Note that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 4.

like image 191
Paul R Avatar answered Sep 25 '22 02:09

Paul R