Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the loop variable type affect efficiency when used to index arrays?

I'm trying to optimize my code down to the last possible cycle, and am wondering if the loop type affects performance when used for array indexing?

I've done some experiments with the following program that just fills an array with 0:

int main(int argc, char **argv)
{
  typedef int CounterType;
  typedef int64_t CounterType;

  CounterType N = atoi(argv[1]);
  uint8_t volatile dummy[N + 16];
  __m128i v = _mm_set1_epi8(0);
  for (int j = 0; j < 1000000; ++j)
  {
    #pragma nounroll
    for (CounterType i = 0; i <= N; i+= CounterType(16))
    {
        _mm_storeu_si128((__m128i *)&dummy[i], v);
    }
  }
  return 0;
}

By using different loop counter types (CounterType) and different compilers, I've recorded the assembly code of the inner loop and performance using hardware performance counters ("perf stat a.out 32768"). I'm running on a Xeon 5670.

GCC4.9, int

.L3
movups  %xmm0, (%rax)
addq    $16, %rax
movl    %eax, %edx
subl    %esi, %edx
cmpl    %ecx, %edx
jle     .L3

 4,127,525,521      cycles                    #    2.934 GHz
12,304,723,292      instructions              #    2.98  insns per cycle

GCC4.9, int64

.L7
movups  %xmm0, (%rcx,%rax)
addq    $16, %rax
cmpq    %rax, %rdx
jge     .L7
4,123,315,191      cycles                    #    2.934 GHz
8,206,745,195      instructions              #    1.99  insns per cycle

ICC11, int64

..B1.6:
movdqu    %xmm0, (%rdx,%rdi)
addq      $16, %rdx
incq      %rcx
cmpq      %rbx, %rcx
jb        ..B1.6        # Prob 82%                      #24.5
2,069,719,166      cycles                    #    2.934 GHz
5,130,061,268      instructions

(faster because of micro-op fusion?)

ICC11, int

..B1.6:                         # Preds ..B1.4 ..B1.6
 movdqu    %xmm0, (%rdx,%rbx)                            #29.38
 addq      $16, %rdx                                     #24.37
 cmpq      %rsi, %rdx                                    #24.34
 jle       ..B1.6        # Prob 82%                      #24.34
4,136,109,529      cycles                    #    2.934 GHz                
8,206,897,268      instructions    

ICC13, int & int64

movdqu    %xmm0, (%rdi,%rax)                            #29.38
addq      $16, %rdi                                     #24.37
cmpq      %rsi, %rdi                                    #24.34
jle       ..B1.7       
4,123,963,321      cycles                    #    2.934 GHz
8,206,083,789      instructions              #    1.99  insns per cycle

The data seems to suggest that int64 is faster. Maybe this is because it matches the pointer size, therefore avoiding any conversions. But I'm not convinced of that conclusion. Another possibility might be that the compiler decided in some cases to do the loop comparison before the store to achieve more parallelism at the cost of 1 extra instruction (due to X86 2 operand instructions being destructive). But that would be incidental, and not fundamentally caused by the loop variable type.

Can some one explain this mystery (preferably knowledgeable about compiler transformations)?

There was also a claim in the CUDA C Best Practices Guide that signed loop counters are simpler than unsigned to generate code for. But that doesn't seem to be relevant here because there are no multiplies in the inner loop for address calculation because that expression gets turned into an induction variable. But apparently in CUDA, it prefers using multiply-add to compute addresses since MADD is 1 instruction just like add and it can cut register use by 1.

like image 944
Yale Zhang Avatar asked Oct 31 '22 23:10

Yale Zhang


1 Answers

Yes the loop variable type can affect efficiency.

Let me suggest an even better solution with GCC.

void distance(uint8_t* dummy, size_t n, const __m128 v0)
{
    intptr_t i;
    for(i = -n; i < 0; i += 4) {
        _mm_store_ps(&((float*)dummy)[i+n], v0);
    }
}

With GCC 4.9.2 and GCC 5.3 this produces this main loop

.L5:
        vmovaps %xmm0, (%rdi,%rax)
        addq    $16, %rax
        js      .L5

Clang 3.6 however still generates the cmp

.LBB0_2:                                # =>This Inner Loop Header: 
        vmovaps %xmm0, 8(%rdi,%rax)
        addq    $4, %rax
        cmpq    $-4, %rax
        jl      .LBB0_2

and Clang 3.7 unrolls four times and uses a cmp.

ICC 13 unrolls twice and uses a cmp so only GCC manages to do this without the unnecessary cmp instruction.

like image 108
Z boson Avatar answered Nov 15 '22 05:11

Z boson