Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is vmovdqu doing here?

I have a Java loop that looks like this:

public void testMethod() {
    int[] nums = new int[10];
    for (int i = 0; i < nums.length; i++) {
        nums[i] = 0x42;
    }
} 

The assembly I get is this:

0x00000001296ac845: cmp    %r10d,%ebp
0x00000001296ac848: jae    0x00000001296ac8b4
0x00000001296ac84a: movl   $0x42,0x10(%rbx,%rbp,4)
0x00000001296ac852: inc    %ebp               
0x00000001296ac854: cmp    %r11d,%ebp
0x00000001296ac857: jl     0x00000001296ac845  

0x00000001296ac859: mov    %r10d,%r8d
0x00000001296ac85c: add    $0xfffffffd,%r8d
0x00000001296ac860: mov    $0x80000000,%r9d
0x00000001296ac866: cmp    %r8d,%r10d
0x00000001296ac869: cmovl  %r9d,%r8d
0x00000001296ac86d: cmp    %r8d,%ebp
0x00000001296ac870: jge    0x00000001296ac88e
0x00000001296ac872: vmovq  -0xda(%rip),%xmm0                                                    
0x00000001296ac87a: vpunpcklqdq %xmm0,%xmm0,%xmm0
0x00000001296ac87e: xchg   %ax,%ax

0x00000001296ac880: vmovdqu %xmm0,0x10(%rbx,%rbp,4)  
0x00000001296ac886: add    $0x4,%ebp          
0x00000001296ac889: cmp    %r8d,%ebp
0x00000001296ac88c: jl     0x00000001296ac880  

If my understanding is correct, the first block of assembly is the one which does nums[i] = 0x42;. In the third block, there's vmovdqu which

The vmovdqu instruction moves values from an integer vector to an unaligned memory location.

However, I still don't fully understand what vmovdqu is doing in context of my loop.

What exactly is the third block of assembly code doing?

The complete code is available here: https://pastebin.com/cT5cJcMS

like image 842
An SO User Avatar asked May 16 '18 16:05

An SO User


3 Answers

Your JIT compiler auto-vectorized your loop, storing 4 ints per asm iteration.

But it made asm that's overcomplicated and missed a lot of optimizations. I wonder if this is maybe only first-stage code-gen before the JIT compiler decided to fully optimize?

Your code doesn't return nums, so it's destroyed right after being created. After inlining, your function should optimize down to no instructions at all. Or as a stand-alone function, should be just a ret. Allocating memory and then letting it get garbage collected is not an observable side-effect that an optimizer needs to preserve.

Then, if new succeeds then nums.length will be 10. So the code could be as simple as

# %rbx holds a pointer to the actual data storage for nums[]
vbroadcastss  -0x????(%rip),%xmm0      # broadcast-load the 0x42 constant into xmm0
vmovdqu       %xmm0,   (%rbx)          # nums[0..3] : 16 bytes
vmovdqu       %xmm0, 16(%rbx)          # nums[4..7] : 16 bytes
vmovq         %xmm0, 32(%rbx)          # nums[8..9] : 8 bytes

Fully unrolling the loop makes the most sense here; setting up loop counters and so on takes more instructions and code-size than a couple stores. Especially when the size isn't a multiple of the vector width, the last partial vector has to be handled specially anyway.

BTW, if your size had been 11 instead of 10, you could either do 8 + 4 byte stores, or a 16-byte store that partially overlapped, e.g. 16-byte vmovdqu stores to (%rbx), 16(%rbx), and 28(%rbx) covering nums[7..11]. A final unaligned vector that ends at the end of the array is a common strategy when manually vectorizing (or in glibc's small-buffer handling for memcpy), but even ahead-of-time compilers don't seem to use it.


Other obvious missed optimizations:

  • vmovq load + vpunpcklqdq to broadcast. With AVX available, vbroadcastss is by far the best way to broadcast a 32-bit constant from memory. One instruction with no ALU uop required. Maybe the JIT compiler doesn't actually know about new AVX instructions?

  • mov %r10d,%r8d + add $-3,%r8d: this should obviously be an lea -3(%r10), %r8d.

It's not clear what the starting value of %ebp is supposed; if the JVM is slicing of chunks of a buffer somewhere so RBX isn't the base of the array, then maybe EBP isn't 0 before the scalar loop? IDK why the loop bound of the scalar loop is in a register, instead of an immediate though.

It's odd to put static data in the same page as code (-0xda(%rip) is still in the same page). There's no big penalty, but it means the same page will need to be in the iTLB and dTLB, so you're covering less total code + data than if you used separate pages. Not a huge deal with 2M hugepages, though. The shared 2nd-level TLB is a victim cache (IIRC), so the iTLB miss that populates it probably won't help the vmovq load get a TLB hit. It will probably do a 2nd page walk.


I don't know why even good ahead-of-time C compilers like gcc and clang overcomplicate this so much, for a loop over an array with unknown alignment and length.

void set42(int *nums, unsigned long int len) {
    for (unsigned long int i=0 ; i<len ; i++ ) {
        *nums++ = 0x42;
    }
}

This is what I'd do by hand, for 128-bit vectors without loop unrolling (and optimistically assuming that it's not worth reaching an alignment boundary, like your JIT, and like clang and gcc8 and higher):

# x86-64 System V calling convention: int*nums in RDI,  len in RSI
set42:
    cmp           $4, %rsi
    jb          .Lsmall_count

    lea          -16(%rdi, %rsi,4), %rdx    # pointer to end-16, the start of the final vector store
    vbroadcastss constant(%rip), %xmm0
 .p2align 4
 .Lvector:                          # do {
    vmovdqu      %xmm0, (%rdi)
    add          $16, %rdi          # nums += 4 elements
    cmp          %rdx, %rdi
    jb           .Lvector           # while(nums < end-16);

    # only reached for sizes >= 16 bytes so we can always store a full possibly-overlapping final vector
    # for len = 16, this results in 2 stores to the same address, but that's cheaper than extra branches even if len=16 is common
    vmovdqu      %xmm0, (%rdx)   # final potentially-overlapping vector
    ret

.Lsmall_count:
    test         %rsi,%rsi
    jz          .Ldone
   # some compilers will fully unroll this with a chain of branches
   # maybe worth doing if small inputs are common
  .Lscalar:                       # do {
    movl        0x42, (%rdi)
    add         $4, %rdi          # *num++ = 0x42;
    dec         %rsi
    jnz                           # }while(--len);
  # a more sophisticated cleanup strategy using SIMD is possible, e.g. 8-byte stores,
  # but I haven't bothered.

.Ldone:
    ret

Notice that for len>=4, there's one fall-through branch at the top, then only the loop branch. Total overhead is 1 macro-fused cmp/jcc, 1 broadcast load, and 1 lea. The loop is 3 uops with a non-indexed addressing mode.

AFAIK, compilers don't know how to use a possibly-overlapping last vector effectively. It's much better than scalar cleanup most of the time time. Notice that for len=4 (16 bytes), we do the same vector store twice. But for len=8 (32 bytes), the loop exits after the first iteration, so we still only do 2 total stores. i.e. with any exact multiple of the vector width other than 1, we don't do an overlapping store. Branching the same way for len=4 and len=8 is actually nice for branch prediction.


Even good ahead-of-time C compilers make this super-complicated, as you can see on the Godbolt compiler explorer. Some of clang's complexity comes from unrolling more; clang6.0 unrolls a huge number of times. (I chose the compiler versions and options that led to the least complicated code. gcc7.3 and clang6.0 emit much larger functions for this.)

gcc7 and earlier go scalar until an alignment boundary, then use aligned vector stores. This can be good if you expect the pointer to be misaligned often, but saving instructions to make the aligned case even cheaper is good when it usually is, and the penalty for misaligned stores is low.

like image 187
Peter Cordes Avatar answered Nov 15 '22 12:11

Peter Cordes


The optimizer has chosen to vectorize your loop, setting 4 values per "iteration". (The instructions preceding the vmovdqu are fairly opaque, but presumably it's splatting 0x42 into all lanes of XMM0.) The "unaligned" variant is necessary because the array is not guaranteed to be SIMD-aligned in memory (after all, it's storing int32s, not int32x4s).

like image 36
Sneftel Avatar answered Nov 15 '22 10:11

Sneftel


The compiler has unrolled the loop to enable vectorization.

// 10d holds the length of the array and ebp holds the loop index.
0x00000001296ac845: cmp    %r10d,%ebp
// This branch is only taken when the loop index `i` is larger or equal to `nums.length`.
0x00000001296ac848: jae    0x00000001296ac8b4
// Performs a single iteration.
0x00000001296ac84a: movl   $0x42,0x10(%rbx,%rbp,4)
// Increment the loop index.
0x00000001296ac852: inc    %ebp
// r11d contains some constant. This is just to ensure that the number of any remaining iterations is multiple of 4.             
0x00000001296ac854: cmp    %r11d,%ebp
// This branch is NOT taken (falls through) only when either zero iterations are left of when the number of remaining iterations is a multiple of 4.
0x00000001296ac857: jl     0x00000001296ac845  
// These instructions make sure that the loop index does not overflow.
0x00000001296ac859: mov    %r10d,%r8d
0x00000001296ac85c: add    $0xfffffffd,%r8d
0x00000001296ac860: mov    $0x80000000,%r9d
0x00000001296ac866: cmp    %r8d,%r10d
0x00000001296ac869: cmovl  %r9d,%r8d
// The next two instructions check whether there are any remaining iterations.
0x00000001296ac86d: cmp    %r8d,%ebp
0x00000001296ac870: jge    0x00000001296ac88e
// If we reach here, the number of remaining iterations must be a multiple of 4.
// Initialize xmm0 with 4 copies of 0x42.
0x00000001296ac872: vmovq  -0xda(%rip),%xmm0                                                    
0x00000001296ac87a: vpunpcklqdq %xmm0,%xmm0,%xmm0
// This is a NOP just to align the loop on a 64-byte cache line boundary for performance.
0x00000001296ac87e: xchg   %ax,%ax
// Vectorized 4 iterations of the loop.
0x00000001296ac880: vmovdqu %xmm0,0x10(%rbx,%rbp,4)  
0x00000001296ac886: add    $0x4,%ebp          
0x00000001296ac889: cmp    %r8d,%ebp
0x00000001296ac88c: jl     0x00000001296ac880
// All iterations have been executed at this point.
like image 34
Hadi Brais Avatar answered Nov 15 '22 12:11

Hadi Brais