In Agner Fog's excellent microarchitecture.pdf (section 9.14) I read that:
Store forwarding works in the following cases: [...] When a write of 128 or 256 bits is followed by a read of the same size and the same address, aligned by 16.
On the other hand, Intel's Architecture Optimization Reference Manual (2.2.5.2 Intel Sandy Bridge, L1 DCache) I read that
Stores cannot forward to loads in the following cases: [...] Any load that crosses a 16-byte boundary of a 32-byte store.
Any load sounds like 32 byte load also.. I wrote the following simple code to test this, and it seems that 32 byte stores are not forwarded to subsequent 32 byte loads on the Sandy Bridge architecture. Here is the code:
#include <stdlib.h>
#include <malloc.h>
int main(){
long i;
// aligned memory address
double *tempa = (double*)memalign(4096, sizeof(double)*4);
for(i=0; i<4; i++) tempa[i] = 1.0;
for(i=0; i<1000000000; i++){ // 1e9 iterations
#ifdef TEST_AVX
__asm__("vmovapd %%ymm12, (%0)\n\t"
"vmovapd (%0), %%ymm12\n\t"
:
:"r"(tempa));
#else
__asm__("movapd %%xmm12, (%0)\n\t"
"movapd (%0), %%xmm12\n\t"
:
:"r"(tempa));
#endif
}
}
The only thing done in the loop is reading/writing from/to a 4k-aligned memory location and a vector register. When compiled with AVX instruction set (gcc -O3 -DTEST_AVX
) the execution time is 3.1s on my 2.7GHz i7-2620M. When using SSE2 instruction set, the time is 2.5s. I have looked at the performance counters. In the AVX case I count one store-forwarding block event per iteration (counter 03H 02H LD_BLOCKS.STORE_FORWARD
). The counter reads 0 for the SSE2 case.
Can anybody shed some light on this? Does SB indeed not support forwarding of 32 byte stores to 32 byte loads? If the latter is the case, spilling ymm
registers seems a rather expensive thing to do..
It seems that there is no store-to-load blocking with 32-byte loads on Sandy Bridge after all. Consider the following modified loop body:
#ifdef TEST_AVX
__asm__("vmovapd %%ymm12, (%0)\n\t"
"vmovapd (%0), %%ymm13\n\t"
:
:"r"(tempa));
#else
__asm__("movapd %%xmm12, (%0)\n\t"
"movapd (%0), %%xmm13\n\t"
:
:"r"(tempa));
#endif
The change is the destination register - I now use two different registers for load and store so that there is no dependence between the two instructions and subsequent iterations. In this case the SSE version takes 1 cycle per iteration, while the AVX version takes 2 cycles. This is consistent with the fact that SB has a capacity of two 16-bytes loads per cycle. Hence, loading 32 bytes takes two cycles - no stall.
The problem must be connected with the counter logic. Clearly, in the AVX case the LD_BLOCKS.STORE_FORWARD
is incremented, although no block takes place. This should be taken into account while analyzing performance using the counters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With