Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently add two vectors in C++

Tags:

c++

x86

simd

sse

sse2

Suppose I have two vectors a and b, stored as a vector. I want to make a += b or a +=b * k, where k is a number.

I can for sure do the following,

while (size--) {
    (*a++) += (*b++) * k;
}

But what are the possible ways to easily leverage SIMD instructions such as SSE2?

like image 815
Nan Hua Avatar asked Dec 05 '25 11:12

Nan Hua


2 Answers

The only thing you should need is to enable auto-vectorization with your compiler.

For example, compiling your code (assuming float) with GCC (5.2.0) -O3 produces this main loop

L8:
    movups  (%rsi,%rax), %xmm1
    addl    $1, %r11d
    mulps   %xmm2, %xmm1
    addps   (%rdi,%rax), %xmm1
    movaps  %xmm1, (%rdi,%rax)
    addq    $16, %rax
    cmpl    %r11d, %r10d
    ja  .L8

Clang also vectorizes the loop but also unrolls four times. Unrolling may help on some processors even though there is no dependency chain especially on Haswell. In fact, you can get GCC to unroll by adding -funroll-loops. GCC will unroll to eight independent operations in this case unlike in the case when there is a dependency chain.

One problem you may encounter is that your compiler may need to add some code to determine if the arrays overlap and make two branches one without vectorization for when they do overlap and one with vectorization for when they don't overlap. GCC and Clang both do this. But ICC does not vectorize the loop.

ICC 13.0.01 with -O3

..B1.4:                         # Preds ..B1.2 ..B1.4
        movss     (%rsi), %xmm1                                 #3.21
        incl      %ecx                                          #2.5
        mulss     %xmm0, %xmm1                                  #3.28
        addss     (%rdi), %xmm1                                 #3.11
        movss     %xmm1, (%rdi)                                 #3.11
        movss     4(%rsi), %xmm2                                #3.21
        addq      $8, %rsi                                      #3.21
        mulss     %xmm0, %xmm2                                  #3.28
        addss     4(%rdi), %xmm2                                #3.11
        movss     %xmm2, 4(%rdi)                                #3.11
        addq      $8, %rdi                                      #3.11
        cmpl      %eax, %ecx                                    #2.5
        jb        ..B1.4        # Prob 63%                      #2.5

To fix this you need to tell the compiler the arrays don't overlap using the __restrict keyword.

void foo(float * __restrict a, float * __restrict b, float k, int size) {
    while (size--) {
        (*a++) += (*b++) * k;
    }
}

In this case ICC produces two branches. One for when the arrays are 16 byte aligned and one for when they are not. Here is the aligned branch

..B1.16:                        # Preds ..B1.16 ..B1.15
        movaps    (%rsi), %xmm2                                 #3.21
        addl      $8, %r8d                                      #2.5
        movaps    16(%rsi), %xmm3                               #3.21
        addq      $32, %rsi                                     #1.6
        mulps     %xmm1, %xmm2                                  #3.28
        mulps     %xmm1, %xmm3                                  #3.28
        addps     (%rdi), %xmm2                                 #3.11
        addps     16(%rdi), %xmm3                               #3.11
        movaps    %xmm2, (%rdi)                                 #3.11
        movaps    %xmm3, 16(%rdi)                               #3.11
        addq      $32, %rdi                                     #1.6
        cmpl      %ecx, %r8d                                    #2.5
        jb        ..B1.16       # Prob 82%                      #2.5

ICC unrolls twice in both cases. Even though GCC and Clang produce a vectorized and unvectorize branch without __restrict you may want to use __restrict anyway to remove the overhead of the code to determine which branch to use.

The last thing you can try is to to tell the compiler the arrays are aligned. This will work with GCC and Clang (3.6)

void foo(float * __restrict a, float * __restrict b, float k, int size) {
    a = (float*)__builtin_assume_aligned (a, 32);
    b = (float*)__builtin_assume_aligned (b, 32);
    while (size--) {
        (*a++) += (*b++) * k;
    }
}

GCC produces in this case

.L4:
    movaps  (%rsi,%r8), %xmm1
    addl    $1, %r10d
    mulps   %xmm2, %xmm1
    addps   (%rdi,%r8), %xmm1
    movaps  %xmm1, (%rdi,%r8)
    addq    $16, %r8
    cmpl    %r10d, %eax
    ja  .L4

Lastly if you compiler supports OpenMP 4.0 you can use OpenMP like this

void foo(float * __restrict a, float * __restrict b, float k, int size) {
    #pragma omp simd aligned(a:32) aligned(b:32)
    for(int i=0; i<size; i++) {
        a[i] += k*b[i];
    }
}

GCC produces the same code in this case as when using __builtin_assume_aligned. This should work for a more recent version of ICC (which I don't have).

I did not check MSVC. I expect it vectorizes this loop as well.

For more details about restrict and the compiler producing different branches with and without overlap and for aligned and not aligned see sum-of-overlapping-arrays-auto-vectorization-and-restrict.


Here is one more suggestion to consider. If you know that the range of the loop is a multiple of the the SIMD width the compiler will not have to use cleanup code. The following code

// gcc -O3
// n = size/8
void foo(float * __restrict a, float * __restrict b, float k, int n) {
    a = (float*)__builtin_assume_aligned (a, 32);
    b = (float*)__builtin_assume_aligned (b, 32);
    //#pragma omp simd aligned(a:32) aligned(b:32)
    for(int i=0; i<n*8; i++) {
        a[i] += k*b[i];
    }
}

produces the simplest assembly so far.

foo(float*, float*, float, int):
    sall    $2, %edx
    testl   %edx, %edx
    jle .L1
    subl    $4, %edx
    shufps  $0, %xmm0, %xmm0
    shrl    $2, %edx
    xorl    %eax, %eax
    xorl    %ecx, %ecx
    addl    $1, %edx
.L4:
    movaps  (%rsi,%rax), %xmm1
    addl    $1, %ecx
    mulps   %xmm0, %xmm1
    addps   (%rdi,%rax), %xmm1
    movaps  %xmm1, (%rdi,%rax)
    addq    $16, %rax
    cmpl    %edx, %ecx
    jb  .L4
.L1:
    rep ret

I used a multiple8 and 32 byte alignment because then just by using the compiler switch -mavx the compiler produces nice AVX vectorization.

foo(float*, float*, float, int):
    sall    $3, %edx
    testl   %edx, %edx
    jle .L5
    vshufps $0, %xmm0, %xmm0, %xmm0
    subl    $8, %edx
    xorl    %eax, %eax
    shrl    $3, %edx
    xorl    %ecx, %ecx
    addl    $1, %edx
    vinsertf128 $1, %xmm0, %ymm0, %ymm0
.L4:
    vmulps  (%rsi,%rax), %ymm0, %ymm1
    addl    $1, %ecx
    vaddps  (%rdi,%rax), %ymm1, %ymm1
    vmovaps %ymm1, (%rdi,%rax)
    addq    $32, %rax
    cmpl    %edx, %ecx
    jb  .L4
    vzeroupper
.L5:
    rep ret

I am not sure how the preamble could be made simpler but the only improvement I see left is to remove one of the iterators and a compare. Namely the addl $1, %ecx instruction should not be necessary. Niether should the cmpl %edx, %ecx be necessary. I'm not sure how to get GCC to fix this. I had a problem like before with GCC (Produce loops without cmp instruction in GCC).

like image 178
Z boson Avatar answered Dec 07 '25 01:12

Z boson


The functions SAXPY (single-precision), DAXPY (double-precision), CAXPY (complex single-precision), and ZAXPY (complex double-precision) compute exactly the expression you want:

Y = a * X + Y

where a is a scalar constant, and X and Y are vectors.

These functions are provided by BLAS libraries and optimized for all practical platforms: for CPUs the best BLAS implementations are OpenBLAS, Intel MKL (optimized for Intel x86 processors and Xeon Phi co-processors only), BLIS, and Apple Accelerate (OS X only); for nVidia GPUs look at cuBLAS (part of CUDA SDK), for any GPUs - ArrayFire.

These libraries are well-optimized and deliver better performance than whatever implementation you can quickly hack up.

like image 24
Marat Dukhan Avatar answered Dec 07 '25 00:12

Marat Dukhan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!