Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding two vector in assembly x86_64 with AVX2 plus technical clarifications

What am I doing wrong here? I'm getting 4 zeros instead of:

2
4
6
8

I would also love to modify my .asm function in order to run through longer vectors 'cause to semplify here I've just used a vector with four elements so that I can sum that vector without a loop with SIMD 256 bit registers.

.cpp

#include <iostream>
#include <chrono>

extern "C" double *addVec(double *C, double *A, double *B, size_t &N);

int main()
{
    size_t N = 1 << 2;
    size_t reductions = N / 4;

    double *A = (double*)_aligned_malloc(N*sizeof(double), 32);
    double *B = (double*)_aligned_malloc(N*sizeof(double), 32);
    double *C = (double*)_aligned_malloc(N*sizeof(double), 32);

    for (size_t i = 0; i < N; i++)
    {
        A[i] = double(i + 1);
        B[i] = double(i + 1);
    }

    auto start = std::chrono::high_resolution_clock::now();

        double *out = addVec(C, A, B, reductions);

    auto finish = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < N; i++)
    {
        std::cout << out[i] << std::endl;
    }

    std::cout << "\n\n";

    std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count() << " ns\n";

    std::cin.get();

    _aligned_free(A);
    _aligned_free(B);
    _aligned_free(C);

    return 0;
}

.asm

.data
; C -> RCX
; A -> RDX
; B -> r8
; N -> r9
.code
    addVec proc
        ;xor rbx, rbx
        align 16
        ;aIn:
            vmovapd ymm0, ymmword ptr [rdx]
            ;vmovapd ymm1, ymmword ptr [rdx + rbx + 4]
            vmovapd ymm2, ymmword ptr [r8]
            ;vmovapd ymm3, ymmword ptr [r8 + rbx + 4]

            vaddpd ymm0, ymm2, ymm3

            vmovapd ymmword ptr [rcx], ymm3
        ;inc rbx
        ;cmp rbx, qword ptr [r9]
        ;jl aIn
        mov rax, rcx    ; return the address of the output vector
    ret
    addVec endp
end

Also I would like to have some other clarifications:

  1. Are there eight 256 bit registers (ymm0-ymm7) for each core o my CPU or there are eight in total?
  2. All the other register like rax, rbx etc... are in total or for each core?
  3. Since I can handle 4 double per cycle just with SIMD coprocessor and one core, can I execute another instruction per cycle with the rest of my CPU? So for example can I add 5 double per cycle with one core? (4 with SIMD + 1)
  4. What if I do something like the following without putting a loop inside my assembly function?:

    #pragma openmp parallel for

    for (size_t i = 0; i < reductions; i++)

    addVec(C + i, A + i, B + i)

    is this going to fork coreNumber + hyperThreading threads and each of them perform a SIMD add on four double? So in total 4 * coreNumber double for each cycle? I can't add the hyperThreading here right?


Update can I do this?:

.data
;// C -> RCX
;// A -> RDX
;// B -> r8
.code
    addVec proc
        ; One cycle 8 micro-op
            vmovapd ymm0, ymmword ptr [rdx]     ; 1 port
            vmovapd ymm1, ymmword ptr [rdx + 32]; 1 port
            vmovapd ymm2, ymmword ptr [r8]      ; 1 port
            vmovapd ymm3, ymmword ptr [r8 + 32] ; 1 port
            vfmadd231pd ymm0, ymm2, ymm4        ; 1 port
            vfmadd231pd ymm1, ymm3, ymm4        ; 1 port
            vmovapd ymmword ptr [rcx], ymm0     ; 1 port
            vmovapd ymmword ptr [rcx + 32], ymm1; 1 port

        ; Return the address of the output vector
        mov rax, rcx                            ; 1 port ?
    ret
    addVec endp
end

Or just this 'cause I would exceed the six ports that you told me?

.data
;// C -> RCX
;// A -> RDX
;// B -> r8
.code
    addVec proc
        ;align 16
        ; One cycle 5 micro-op ?
        vmovapd ymm0, ymmword ptr [rdx]     ; 1 port
        vmovapd ymm1, ymmword ptr [r8]      ; 1 port
        vfmadd231pd ymm0, ymm1, ymm2        ; 1 port
        vmovapd ymmword ptr [rcx], ymm0     ; 1 port

        ; Return the address of the output vector
        mov rax, rcx                        ; 1 port ?
    ret
    addVec endp
end
like image 404
Nerva Avatar asked Nov 08 '14 18:11

Nerva


1 Answers

The reason your code gets the wrong result is you have the syntax in your assembly backwards.

You're using Intel syntax in which the destination should come before the source. So in your original .asm code you should change

vaddpd ymm0, ymm2, ymm3

to

 vaddpd ymm3, ymm2, ymm0

One way to see this is to use intrinsics and then look at the disassembly.

extern "C" double *addVec(double * __restrict C, double * __restrict A, double * __restrict B, size_t &N) {
    __m256d x = _mm256_load_pd((const double*)A);
    __m256d y = _mm256_load_pd((const double*)B);
    __m256d z = _mm256_add_pd(x,y);
    _mm256_store_pd((double*)C, z);
    return C;
}

The dissemblely from GCC on Linux using g++ -S -O3 -mavx -masm=intel -mabi=ms foo.cpp gives:

vmovapd ymm0, YMMWORD PTR [rdx]
mov     rax, rcx
vaddpd  ymm0, ymm0, YMMWORD PTR [r8]
vmovapd YMMWORD PTR [rcx], ymm0
vzeroupper
ret

The vaddpd ymm0, ymm0, YMMWORD PTR [rdx] instruction fuses a load and addition into one fused micro-op. When I use that function with your code it gets 2,4,6,8.

You can find source code that sums two arrays x and y and writes out to an array z at l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096. This uses intrinsics and unrolls eight times. Dissasemble the code with gcc -S or objdump -d. Another source which does almost the same thing and is written in assembly is at obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62. In the file triad_fma_asm.asm change the line pi: dd 3.14159 to pi: dd 1.0. Both these examples use single floating point so if you want double you will have to make the necessary changes.

The answers to your other questions are:

  1. Each core of your processor is a physically different unit with its own set of registers. Each core has 16 general purpose registers (e.g rax, rbx, r8, r9,...) and several special purpose registers (e.g. RFLAGS). In 32-bit mode each core has eight 256-bit registers and in 64-bit mode sixteen 256-bit registers. When AVX-512 is available there will be thirty-two 512-bit registers (but only eight in 32-bit mode).

Note that each core has far more registers than the logical ones you can program directly.

  1. see 1. above

  2. Core2 processors since 2006 through Haswell all can process at most four µop per clock. However, using two techniques called micro-op fusion and macro-op fusion it's possible to achieve six micro-ops per clock cycle with Haswell.

Micro-op fusion can fuse e.g. a load and an addition into one so called fused micro-op but each micro-op still needs its own port. Macro-op fusion can fuse e.g a scalar add and a jump into one micro-op that only needs one port. Macro-op fusion is essentially two for one.

Haswell has eight ports. You can get six micro-ops in one clock cycle using seven ports like this.

256-load + 256-FMA    //one fused µop using two ports
256-load + 256-FMA    //one fused µop using two ports
256-store             //one µop using two ports
64-bit add + jump     //one µop using one port

So in fact each core of Haswell can process sixteen doubles (four multiplication and four additions for each FMA), two 256-loads, one 256-bit store, and one 64-bit addition and branch in one clock cycle. In this question, obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62, I obtained (in theory) five micro-ops in one clock cycle using six ports. However, in practice on Haswell this is difficult to achieve.

For your particular operation which reads two arrays and writes one it's bound by the two reads per clock cycle so it can only issue one FMA per clock cycle. So the best it can do is four doubles per clock cycle.

  1. If you properly parallelize your code and your processor has four physical cores then you could achieve 64 double floating point operations (2FMA*4cores) in one clock cycle. This would be the theoretic best for some operation but not for the operation in your question.

But let me let me tell you the little secret that Intel does not want people talking much about. Most operations are memory bandwidth bound and can't benefit much from parallelization. This includes the operation in your question. So although Intel keeps coming out with new technology every few years (e.g. AVX, FMA, AVX512, doubling the number of cores) which doubles the performance each time to claim that Moore's Law is being obtained in practice the average benefit is linear and not exponential and it has been for several years now.

like image 128
Z boson Avatar answered Oct 29 '22 18:10

Z boson