Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIMD prefix sum on Intel cpu

Tags:

I need to implement a prefix sum algorithm and would need it to be as fast as possible.
Ex:

[3, 1,  7,  0,  4,  1,  6,  3] 

should give:

[3, 4, 11, 11, 15, 16, 22, 25] 

Is there a way to do this using SSE SIMD CPU instruction?

My first idea is to sum each pair in parallel recursively until all sum have been computed like below!

//in parallel do  for (int i = 0; i < z.length; i++) {     z[i] = x[i << 1] + x[(i << 1) + 1]; } 

To make the algorithm a little bit more clear, z is not the final output, but instead used to compute the output.

int[] w = computePrefixSum(z); for (int i = 1; i < ouput.length; i++) {     ouput[i] = (i % 2 == 0) ? (x[i] + ouput[i - 1]) :  w[(i - 1) >> 1]; } 
like image 893
skyde Avatar asked May 14 '12 16:05

skyde


2 Answers

The fastest parallel prefix sum algorithm I know of is to run over the sum in two passes in parallel and use SSE as well in the second pass.

In the first pass you calculate partial sums in parallel and store the total sum for each partial sum. In the second pass you add the total sum from the preceding partial sum to the next partial sum. You can run both passes in parallel using multiple threads (e.g. with OpenMP). The second pass you can also use SIMD since a constant value is being added to each partial sum.

Assuming n elements of an array, m cores, and a SIMD width of w the time cost should be

n/m + n/(m*w) = (n/m)*(1+1/w) 

Since the fist pass does not use SIMD the time cost will always be greater than n/m

For example for four cores with a SIMD_width of 4 (four 32bit floats with SSE) the cost would be 5n/16. Or about 3.2 times faster than sequential code which has a time cost of n. Using hyper threading the speed up will be greater still.

In special cases it's possible to use SIMD on the first pass as well. Then the time cost is simply

2*n/(m*w) 

I posted the code for the general case which uses OpenMP for the threading and intrinsics for the SSE code and discuss details about the special case at the following link parallel-prefix-cumulative-sum-with-sse

Edit: I managed to find a SIMD version for the first pass which is about twice as fast as sequential code. Now I get a total boost of about 7 on my four core ivy bridge system.

Edit: For larger arrays one problem is that after the first pass most values have been evicted from the cache. I came up with a solution which runs in parallel inside a chunk but runs each chunk serially. The chunk_size is a value that should be tuned. For example I set it to 1MB = 256K floats. Now the second pass is done while the values are still inside the level-2 cache. Doing this gives a big improvement for large arrays.

Here is the code for SSE. The AVX code is about the same speed so I did not post it here. The function which does the prefix sum is scan_omp_SSEp2_SSEp1_chunk. Pass it an array a of floats and it fills the array s with the cumulative sum.

__m128 scan_SSE(__m128 x) {     x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));      x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));      return x; }  float pass1_SSE(float *a, float *s, const int n) {     __m128 offset = _mm_setzero_ps();     #pragma omp for schedule(static) nowait     for (int i = 0; i < n / 4; i++) {         __m128 x = _mm_load_ps(&a[4 * i]);         __m128 out = scan_SSE(x);         out = _mm_add_ps(out, offset);         _mm_store_ps(&s[4 * i], out);         offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));     }     float tmp[4];     _mm_store_ps(tmp, offset);     return tmp[3]; }  void pass2_SSE(float *s, __m128 offset, const int n) {     #pragma omp for schedule(static)     for (int i = 0; i<n/4; i++) {         __m128 tmp1 = _mm_load_ps(&s[4 * i]);         tmp1 = _mm_add_ps(tmp1, offset);         _mm_store_ps(&s[4 * i], tmp1);     } }  void scan_omp_SSEp2_SSEp1_chunk(float a[], float s[], int n) {     float *suma;     const int chunk_size = 1<<18;     const int nchunks = n%chunk_size == 0 ? n / chunk_size : n / chunk_size + 1;     //printf("nchunks %d\n", nchunks);     #pragma omp parallel     {         const int ithread = omp_get_thread_num();         const int nthreads = omp_get_num_threads();          #pragma omp single         {             suma = new float[nthreads + 1];             suma[0] = 0;         }          float offset2 = 0.0f;         for (int c = 0; c < nchunks; c++) {             const int start = c*chunk_size;             const int chunk = (c + 1)*chunk_size < n ? chunk_size : n - c*chunk_size;             suma[ithread + 1] = pass1_SSE(&a[start], &s[start], chunk);             #pragma omp barrier             #pragma omp single             {                 float tmp = 0;                 for (int i = 0; i < (nthreads + 1); i++) {                     tmp += suma[i];                     suma[i] = tmp;                 }             }             __m128 offset = _mm_set1_ps(suma[ithread]+offset2);             pass2_SSE(&s[start], offset, chunk);             #pragma omp barrier             offset2 = s[start + chunk-1];         }     }     delete[] suma; } 
like image 58
Z boson Avatar answered Nov 08 '22 03:11

Z boson


You can exploit some minor parallelism for large register lengths and small sums. For instance, adding up 16 values of 1 byte (which happen to fit into one sse register) requires only log216 additions and an equal number of shifts.
Not much, but faster than 15 depended additions and the additional memory accesses.

__m128i x = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); x = _mm_add_epi8(x, _mm_srli_si128(x, 1)); x = _mm_add_epi8(x, _mm_srli_si128(x, 2)); x = _mm_add_epi8(x, _mm_srli_si128(x, 4)); x = _mm_add_epi8(x, _mm_srli_si128(x, 8));  // x == 3, 4, 11, 11, 15, 16, 22, 25, 28, 29, 36, 36, 40, 41, 47, 50 

If you have longer sums, the dependencies could be hidden by exploiting instruction level parallelism and taking advantage of instruction reordering.

Edit: something like

__m128i x0 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i x1 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i x2 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3); __m128i x3 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3);  __m128i mask = _mm_set_epi8(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);  x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 1)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 1)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 1)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 1));  x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 2)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 2)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 2)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 2));  x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 4)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 4)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 4)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 4));  x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 8)); x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 8)); x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 8)); x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 8));  x1 = _mm_add_epi8(_mm_shuffle_epi8(x0, mask), x1); x2 = _mm_add_epi8(_mm_shuffle_epi8(x1, mask), x2); x3 = _mm_add_epi8(_mm_shuffle_epi8(x2, mask), x3); 
like image 27
Gunther Piez Avatar answered Nov 08 '22 03:11

Gunther Piez