What am I doing wrong here? I'm getting 4 zeros instead of:
2
4
6
8
I would also love to modify my .asm function in order to run through longer vectors 'cause to semplify here I've just used a vector with four elements so that I can sum that vector without a loop with SIMD 256 bit registers.
#include <iostream>
#include <chrono>
extern "C" double *addVec(double *C, double *A, double *B, size_t &N);
int main()
{
size_t N = 1 << 2;
size_t reductions = N / 4;
double *A = (double*)_aligned_malloc(N*sizeof(double), 32);
double *B = (double*)_aligned_malloc(N*sizeof(double), 32);
double *C = (double*)_aligned_malloc(N*sizeof(double), 32);
for (size_t i = 0; i < N; i++)
{
A[i] = double(i + 1);
B[i] = double(i + 1);
}
auto start = std::chrono::high_resolution_clock::now();
double *out = addVec(C, A, B, reductions);
auto finish = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < N; i++)
{
std::cout << out[i] << std::endl;
}
std::cout << "\n\n";
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count() << " ns\n";
std::cin.get();
_aligned_free(A);
_aligned_free(B);
_aligned_free(C);
return 0;
}
.data
; C -> RCX
; A -> RDX
; B -> r8
; N -> r9
.code
addVec proc
;xor rbx, rbx
align 16
;aIn:
vmovapd ymm0, ymmword ptr [rdx]
;vmovapd ymm1, ymmword ptr [rdx + rbx + 4]
vmovapd ymm2, ymmword ptr [r8]
;vmovapd ymm3, ymmword ptr [r8 + rbx + 4]
vaddpd ymm0, ymm2, ymm3
vmovapd ymmword ptr [rcx], ymm3
;inc rbx
;cmp rbx, qword ptr [r9]
;jl aIn
mov rax, rcx ; return the address of the output vector
ret
addVec endp
end
Also I would like to have some other clarifications:
What if I do something like the following without putting a loop inside my assembly function?:
#pragma openmp parallel for
for (size_t i = 0; i < reductions; i++)
addVec(C + i, A + i, B + i)
is this going to fork coreNumber + hyperThreading threads and each of them perform a SIMD add on four double? So in total 4 * coreNumber double for each cycle? I can't add the hyperThreading here right?
Update can I do this?:
.data
;// C -> RCX
;// A -> RDX
;// B -> r8
.code
addVec proc
; One cycle 8 micro-op
vmovapd ymm0, ymmword ptr [rdx] ; 1 port
vmovapd ymm1, ymmword ptr [rdx + 32]; 1 port
vmovapd ymm2, ymmword ptr [r8] ; 1 port
vmovapd ymm3, ymmword ptr [r8 + 32] ; 1 port
vfmadd231pd ymm0, ymm2, ymm4 ; 1 port
vfmadd231pd ymm1, ymm3, ymm4 ; 1 port
vmovapd ymmword ptr [rcx], ymm0 ; 1 port
vmovapd ymmword ptr [rcx + 32], ymm1; 1 port
; Return the address of the output vector
mov rax, rcx ; 1 port ?
ret
addVec endp
end
Or just this 'cause I would exceed the six ports that you told me?
.data
;// C -> RCX
;// A -> RDX
;// B -> r8
.code
addVec proc
;align 16
; One cycle 5 micro-op ?
vmovapd ymm0, ymmword ptr [rdx] ; 1 port
vmovapd ymm1, ymmword ptr [r8] ; 1 port
vfmadd231pd ymm0, ymm1, ymm2 ; 1 port
vmovapd ymmword ptr [rcx], ymm0 ; 1 port
; Return the address of the output vector
mov rax, rcx ; 1 port ?
ret
addVec endp
end
The reason your code gets the wrong result is you have the syntax in your assembly backwards.
You're using Intel syntax in which the destination should come before the source. So in your original .asm code you should change
vaddpd ymm0, ymm2, ymm3
to
vaddpd ymm3, ymm2, ymm0
One way to see this is to use intrinsics and then look at the disassembly.
extern "C" double *addVec(double * __restrict C, double * __restrict A, double * __restrict B, size_t &N) {
__m256d x = _mm256_load_pd((const double*)A);
__m256d y = _mm256_load_pd((const double*)B);
__m256d z = _mm256_add_pd(x,y);
_mm256_store_pd((double*)C, z);
return C;
}
The dissemblely from GCC on Linux using g++ -S -O3 -mavx -masm=intel -mabi=ms foo.cpp
gives:
vmovapd ymm0, YMMWORD PTR [rdx]
mov rax, rcx
vaddpd ymm0, ymm0, YMMWORD PTR [r8]
vmovapd YMMWORD PTR [rcx], ymm0
vzeroupper
ret
The vaddpd ymm0, ymm0, YMMWORD PTR [rdx]
instruction fuses a load and addition into one fused micro-op. When I use that function with your code it gets 2,4,6,8.
You can find source code that sums two arrays x
and y
and writes out to an array z
at l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096. This uses intrinsics and unrolls eight times. Dissasemble the code with gcc -S
or objdump -d
. Another source which does almost the same thing and is written in assembly is at obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62. In the file triad_fma_asm.asm
change the line pi: dd 3.14159
to pi: dd 1.0
. Both these examples use single floating point so if you want double you will have to make the necessary changes.
The answers to your other questions are:
Note that each core has far more registers than the logical ones you can program directly.
see 1. above
Core2 processors since 2006 through Haswell all can process at most four µop per clock. However, using two techniques called micro-op fusion and macro-op fusion it's possible to achieve six micro-ops per clock cycle with Haswell.
Micro-op fusion can fuse e.g. a load and an addition into one so called fused micro-op but each micro-op still needs its own port. Macro-op fusion can fuse e.g a scalar add and a jump into one micro-op that only needs one port. Macro-op fusion is essentially two for one.
Haswell has eight ports. You can get six micro-ops in one clock cycle using seven ports like this.
256-load + 256-FMA //one fused µop using two ports
256-load + 256-FMA //one fused µop using two ports
256-store //one µop using two ports
64-bit add + jump //one µop using one port
So in fact each core of Haswell can process sixteen doubles (four multiplication and four additions for each FMA), two 256-loads, one 256-bit store, and one 64-bit addition and branch in one clock cycle. In this question, obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62, I obtained (in theory) five micro-ops in one clock cycle using six ports. However, in practice on Haswell this is difficult to achieve.
For your particular operation which reads two arrays and writes one it's bound by the two reads per clock cycle so it can only issue one FMA per clock cycle. So the best it can do is four doubles per clock cycle.
But let me let me tell you the little secret that Intel does not want people talking much about. Most operations are memory bandwidth bound and can't benefit much from parallelization. This includes the operation in your question. So although Intel keeps coming out with new technology every few years (e.g. AVX, FMA, AVX512, doubling the number of cores) which doubles the performance each time to claim that Moore's Law is being obtained in practice the average benefit is linear and not exponential and it has been for several years now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With