Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel prefix (cumulative) sum with SSE

Tags:

c

sum

sse

openmp

I'm looking for some advice on how to do a parallel prefix sum with SSE. I'm interested in doing this on an array of ints, floats, or doubles.

I have come up with two solutions. A special case and a general case. In both cases the solution runs over the array in two passes in parallel with OpenMP. For the special case I use SSE on both passes. For the general case I use it only on the second pass.

My main question is how I can use SSE on the first pass in the general case? The following link simd-prefix-sum-on-intel-cpu show an improvement for bytes but not for 32bit data types.

The reason the special case is called special is that it requires the array be in a special format. For example let's assume there were only 16 elements of an arrayaof floats. Then if the array was rearranged like this (array of structs to struct of arrays):

a[0] a[1] ...a[15] -> a[0] a[4] a[8] a[12] a[1] a[5] a[9] a[13]...a[3] a[7] a[11] a[15]

SSE vertical sums could be used on both passes. However, this would only be efficient if the arrays were already in the special format and the output could be used in the special format. Otherwise expensive rearrange would have to be done on both input and output which would make it much slower than the general case.

Maybe I should consider a different algorithm for the prefix sum (e.g. a binary tree)?

Code for the general case:

void prefix_sum_omp_sse(double a[], double s[], int n) {
    double *suma;
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();
        #pragma omp single
        {
            suma = new double[nthreads + 1];
            suma[0] = 0;
        }
        double sum = 0;
        #pragma omp for schedule(static) nowait //first parallel pass
        for (int i = 0; i<n; i++) {
            sum += a[i];
            s[i] = sum;
        }
        suma[ithread + 1] = sum;
        #pragma omp barrier
        #pragma omp single
        {
            double tmp = 0;
            for (int i = 0; i<(nthreads + 1); i++) {
                tmp += suma[i];
                suma[i] = tmp;
            }
        }
        __m128d offset = _mm_set1_pd(suma[ithread]);
        #pragma omp for schedule(static) //second parallel pass with SSE as well
        for (int i = 0; i<n/4; i++) {       
            __m128d tmp1 = _mm_load_pd(&s[4*i]);
            tmp1 = _mm_add_pd(tmp1, offset);    
            __m128d tmp2 = _mm_load_pd(&s[4*i+2]);
            tmp2 = _mm_add_pd(tmp2, offset);
            _mm_store_pd(&s[4*i], tmp1);
            _mm_store_pd(&s[4*i+2], tmp2);
        }
    }
    delete[] suma;
}
like image 253
Z boson Avatar asked Oct 21 '13 12:10

Z boson


1 Answers

This is the first time I'm answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words.

The general theory goes as follows. For a sequential scan of n words it takes n additions (n-1 to scan the n words and one more addition carried from the previous set of words scanned). However using SIMD n words can be scanned in log2(n) additions and an equal number of shifts plus one more addition and broadcast to carry from the previous SIMD scan. So for some value of n the SIMD method will win.

Let's look at 32-bit words with SSE, AVX, and AVX-512:

4 32-bit words (SSE):      2 shifts, 3 adds, 1 broadcast       sequential: 4 adds
8 32-bit words (AVX):      3 shifts, 4 adds, 1 broadcast       sequential: 8 adds
16 32 bit-words (AVX-512): 4 shifts, 5 adds, 1 broadcast       sequential: 16 adds

Based on that it appears SIMD won't be useful for a scan for 32-bit words until AVX-512. This also assumes that the shifts and broadcast can be done in only 1 instruction. This is true for SSE but not for AVX and maybe not even for AVX2.

In any case I put together some working and tested code which does a prefix sum using SSE.

inline __m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
    return x;
}

void prefix_sum_SSE(float *a, float *s, const int n) {
__m128 offset = _mm_setzero_ps();
for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_load_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    _mm_store_ps(&s[i], out);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3)); 
}

Notice that the scan_SSE function has two additions (_mm_add_ps) and two shifts (_mm_slli_si128). The casts are only used to make the compiler happy and don't get converted to instructions. Then inside the main loop over the array in prefix_sum_SSE another addition and one shuffle is used. That's 6 operations total compared to only 4 additions with the sequential sum.

Here is a working solution for AVX:

inline __m256 scan_AVX(__m256 x) {
    __m256 t0, t1;
    //shift1_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x11));
    //shift2_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x33));
    //shift3_AVX + add
    x = _mm256_add_ps(x,_mm256_permute2f128_ps(x, x, 41));
    return x;
}

void prefix_sum_AVX(float *a, float *s, const int n) {
    __m256 offset = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 x = _mm256_loadu_ps(&a[i]);
        __m256 out = scan_AVX(x);
        out = _mm256_add_ps(out, offset);
        _mm256_storeu_ps(&s[i], out);
        //broadcast last element
        __m256 t0 = _mm256_permute2f128_ps(out, out, 0x11);
        offset = _mm256_permute_ps(t0, 0xff);
    }   
}

The three shifts need 7 intrinsics. The broadcast needs 2 intrinsics. So with the 4 additions that's 13 intrinisics. For AVX2 only 5 intrinsics are needed for the shifts so 11 intrinsics total. The sequential sum only needs 8 additions. Therefore likely neither AVX nor AVX2 will be useful for the first pass.

Edit:

So I finally benchmarked this and the results are unexpected. The SSE and AVX code are both about twice as fast as the following sequential code:

void scan(float a[], float s[], int n) {
    float sum = 0;
    for (int i = 0; i<n; i++) {
        sum += a[i];
        s[i] = sum;
    }
}

I guess this is due to instruction level paralellism.

So that answers my own question. I succeeded in using SIMD for pass1 in the general case. When I combine this with OpenMP on my 4 core ivy bridge system the total speed up is about seven for 512k floats.

like image 134
Z boson Avatar answered Sep 30 '22 16:09

Z boson