Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it worth bothering to align AVX-256 memory stores?

According to the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section B.4 ("Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge"), subsection B.4.5.2 ("Assists"):

32-byte AVX store instructions that span two pages require an assist that costs roughly 150 cycles.

I'm using YMM registers for copying small fixed-sized memory blocks, from 32 to 128 bytes, and the blocks are aligned by 16 bytes, in a heap manager. That heap manager has used XMM registers before with movdqa, and I would like to "upgrade" it to YMM, without changing the alignment from 16 to 32 bytes. So I'm using vmovdqu ymm0, ymmword ptr [rcx], then vmovdqu ymmword ptr [rdx], ymm0 etc...

If I understood Intel's document correctly about the page size, if I do a 32-byte store across a 4K-page boundary, I will get 150 cycles penalty.

But since the blocks are already aligned by 16 bytes, the chances that I hit the cross-page store is 16/4096 = 1/256. If we statistically extrapolate that, on each 32-byte store, I get 1/256*150 (=0.5859375) cycles penalty on Sandy Bridge.

This is not that much and is definitely cheaper than branching to check the alignment, or memory waste because of changing alignment from 16 bytes to 32 bytes.

I have the following questions:

  1. Are my calculations correct?

  2. Is aligning AVX-256 memory stores worth bothering for small fixed-sized memory copy routines (32-128 bytes), given that chances to hit the penalty are so low?

  3. Are there processors that have higher unaligned 32-byte store penalties than Sandy Bridge—e.g., AMD or other Intel microarchitectures?

like image 434
Maxim Masiutin Avatar asked Jun 16 '17 09:06

Maxim Masiutin


1 Answers

Is it worth bothering to align [...] ?

Yes, definitely worth it and its also very cheap.

You can do aligned writes to an unaligned block easily without needing jumps.
For example:

//assume rcx = length of block, assume length > 8.
//assume rdx = pointer to block
xor rax,rax
mov r9,rdx         //remember r9 for later
sub rcx,8           
mov [rdx],rax      //start with an unaligned write
and rdx,not(7)     //force alignment
lea r8,[rdx+rcx]   //finish with unaligned tail write
xor r9,rdx         //Get the misaligned byte count.
sub rcx,r9
jl @tail           //jl and fuse with sub
@loop:
  mov [rdx],rax    //all writes in this block are aligned.
  lea rdx,[rdx+8]  
  sub rcx,8
  jns @loop
@tail 
mov [r8],rax       //unaligned tail write

I'm sure you can extrapolate this example from a non-unrolled example to an optimized AVX2 example.

Alignment is a simple matter of a misalignment= start and not(alignmentsize -1).
You can then do a misalignmentcount = start xor misalingment to get a count of the misaligned bytes.

None of this requires jumps.
I'm sure you can translate this to AVX.

The below code for FillChar about 3x faster than the standard libs.
Note that I've used jumps, the testing showed it was faster to do so.

{$ifdef CPUX64}
procedure FillChar(var Dest; Count: NativeInt; Value: Byte);
//rcx = dest
//rdx=count
//r8b=value
asm
              .noframe
              .align 16
              movzx r8,r8b           //There's no need to optimize for count <= 3
              mov rax,$0101010101010101
              mov r9d,edx
              imul rax,r8            //fill rax with value.
              cmp edx,59             //Use simple code for small blocks.
              jl  @Below32
@Above32:     mov r11,rcx
              rep mov r8b,7          //code shrink to help alignment.
              lea r9,[rcx+rdx]       //r9=end of array
              sub rdx,8
              rep mov [rcx],rax      //unaligned write to start of block
              add rcx,8              //progress 8 bytes 
              and r11,r8             //is count > 8? 
              jz @tail
@NotAligned:  xor rcx,r11            //align dest
              lea rdx,[rdx+r11]
@tail:        test r9,r8             //and 7 is tail aligned?
              jz @alignOK
@tailwrite:   mov [r9-8],rax         //no, we need to do a tail write
              and r9,r8              //and 7
              sub rdx,r9             //dec(count, tailcount)
@alignOK:     mov r10,rdx
              and edx,(32+16+8)      //count the partial iterations of the loop
              mov r8b,64             //code shrink to help alignment.
              mov r9,rdx
              jz @Initloop64
@partialloop: shr r9,1              //every instruction is 4 bytes
              lea r11,[rip + @partial +(4*7)] //start at the end of the loop
              sub r11,r9            //step back as needed
              add rcx,rdx            //add the partial loop count to dest
              cmp r10,r8             //do we need to do more loops?
              jmp r11                //do a partial loop
@Initloop64:  shr r10,6              //any work left?
              jz @done               //no, return
              mov rdx,r10
              shr r10,(19-6)         //use non-temporal move for > 512kb
              jnz @InitFillHuge
@Doloop64:    add rcx,r8
              dec edx
              mov [rcx-64+00H],rax
              mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jnz @DoLoop64
@done:        rep ret
              //db $66,$66,$0f,$1f,$44,$00,$00 //nop7
@partial:     mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jge @Initloop64        //are we done with all loops?
              rep ret
              db $0F,$1F,$40,$00
@InitFillHuge:
@FillHuge:    add rcx,r8
              dec rdx
              db $48,$0F,$C3,$41,$C0 // movnti  [rcx-64+00H],rax
              db $48,$0F,$C3,$41,$C8 // movnti  [rcx-64+08H],rax
              db $48,$0F,$C3,$41,$D0 // movnti  [rcx-64+10H],rax
              db $48,$0F,$C3,$41,$D8 // movnti  [rcx-64+18H],rax
              db $48,$0F,$C3,$41,$E0 // movnti  [rcx-64+20H],rax
              db $48,$0F,$C3,$41,$E8 // movnti  [rcx-64+28H],rax
              db $48,$0F,$C3,$41,$F0 // movnti  [rcx-64+30H],rax
              db $48,$0F,$C3,$41,$F8 // movnti  [rcx-64+38H],rax
              jnz @FillHuge
@donefillhuge:mfence
              rep ret
              db $0F,$1F,$44,$00,$00  //db $0F,$1F,$40,$00
@Below32:     and  r9d,not(3)
              jz @SizeIs3
@FillTail:    sub   edx,4
              lea   r10,[rip + @SmallFill + (15*4)]
              sub   r10,r9
              jmp   r10
@SmallFill:   rep mov [rcx+56], eax
              rep mov [rcx+52], eax
              rep mov [rcx+48], eax
              rep mov [rcx+44], eax
              rep mov [rcx+40], eax
              rep mov [rcx+36], eax
              rep mov [rcx+32], eax
              rep mov [rcx+28], eax
              rep mov [rcx+24], eax
              rep mov [rcx+20], eax
              rep mov [rcx+16], eax
              rep mov [rcx+12], eax
              rep mov [rcx+08], eax
              rep mov [rcx+04], eax
              mov [rcx],eax
@Fallthough:  mov [rcx+rdx],eax  //unaligned write to fix up tail
              rep ret

@SizeIs3:     shl edx,2           //r9 <= 3  r9*4
              lea r10,[rip + @do3 + (4*3)]
              sub r10,rdx
              jmp r10
@do3:         rep mov [rcx+2],al
@do2:         mov [rcx],ax
              ret
@do1:         mov [rcx],al
              rep ret
@do0:         rep ret
end;
{$endif}

This is not that much and is definitely cheaper than branching to check the alignment
I think the checks are quite cheap (see above). Note that you can have pathological cases where the incur the penalty all the time, because the blocks happen to straddle the lines a lot.

About mixing AVX and SSE code
On Intel there is a 300+ cycle penalty for mixing AVX and (legacy, i.e. non-VEX encoded) SSE instructions.
If you use AVX2 instructions to write to memory you'll incur a penalty if you use SSE code in the rest of your application and Delphi 64 uses SSE exclusively for floating point.
Using AVX2 code in this context would incur crippling delays. For this reason alone I suggest you don't consider AVX2.

There is no need for AVX2
You can saturate the memory bus using 64 bit general purpose registers doing just writes.
When doing combined reads and writes, 128 bits reads and writes will also easily saturate the bus.
This is true on older processors and obviously also true if you move beyond the L1 cache, but not true on the latest processors.

Why is there a penalty for mixing AVX and SSE (legacy) code?
Intel writes the following:

Initially the processor is in clean state (1), where Intel SSE and Intel AVX instructions are executed with no penalty. When a 256-bit Intel AVX instruction is executed, the processor marks that it is in the Dirty Upper state (2). While in this state, executing an Intel SSE instruction saves the upper 128 bits of all YMM registers and the state changes to Saved Dirty Upper state (3). Next time an Intel AVX instruction is executed the upper 128 bits of all YMM registers are restored and the processor is back at state (2). These save and restore operations have a high penalty. Frequent execution of these transitions causes significant performance loss.

There is also the issue of dark silicon. AVX2 code uses a lot of hardware, having all that silicon lit up uses a lot of power which affects the thermal headroom. When executing AVX2 code the CPU throttles down, sometimes even below the normal non-turbo threshold. By powering down the circuitry for 256-bit AVX the CPU can achieve higher turbo clocks because of the better thermal headroom. The off switch for AVX2 circuitry is not seeing 256 bit code for a longish time duration (675us) and the on-switch is seeing AVX2 code. Mixing the two causes switching on and off of circuitry which takes many cycles.

like image 91
Johan Avatar answered Oct 06 '22 06:10

Johan