sum of overlapping arrays, auto-vectorization, and restrict

Tags:

Arstechnia recently had an article Why are some programming languages faster than others. It compares Fortran and C and mentions summing arrays. In Fortran it's assumed that arrays don't overlap so that allows further optimization. In C/C++ pointers to the same type may overlap so this optimization can't be used in general. However, in C/C++ one can use the restrict or __restrict keyword to tell the compiler not to assume the pointers overlap. So I started looking into this in regards to auto-vectorization.

The following code vectorizes in GCC and MSVC

Click to copy

void dot_int(int *a, int *b, int *c, int n) {
    for(int i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

I tested this with and without overlapping arrays and it gets the correct result. However, the way I would vectorize this loop manually with SSE does not handle overlapping arrays.

Click to copy

int i=0;    
for(; i<n-3; i+=4) {
    __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]);
    __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]);
    __m128i c4 = _mm_add_epi32(a4,b4);
    _mm_storeu_si128((__m128i*)c, c4);
}
for(; i<n; i++) {
    c[i] = a[i] + b[i];
}

Next I tried using __restrict. I assumed that since the compiler could assume the arrays don't overlap it would not handle overlapping arrays but both GCC and MSVC still get the correct result for overlapping arrays even with __restrict.

Click to copy

void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
    for(int i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

Why does the auto-vectorized code with and without __restrict get the correct result for overlapping arrays?.

Here is the full code I used to test this:

Click to copy

#include <stdio.h>
#include <immintrin.h>
void dot_int(int *a, int *b, int *c, int n) {
    for(int i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
    for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n"); 
}

void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
    for(int i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
    for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n"); 
}

void dot_int_SSE(int *a, int *b, int *c, int n) {
    int i=0;    
    for(; i<n-3; i+=4) {
        __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]);
        __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]);
        __m128i c4 = _mm_add_epi32(a4,b4);
        _mm_storeu_si128((__m128i*)c, c4);
    }
    for(; i<n; i++) {
        c[i] = a[i] + b[i];
    }
    for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n"); 
}

int main() {
    const int n = 100;
    int a[] = {1,1,1,1,1,1,1,1};
    int b1[] = {1,1,1,1,1,1,1,1,1};
    int b2[] = {1,1,1,1,1,1,1,1,1};
    int b3[] = {1,1,1,1,1,1,1,1,1};

    int c[8];
    int *c1 = &b1[1];
    int *c2 = &b2[1];
    int *c3 = &b3[1];

    dot_int(a,b1,c, 8);
    dot_int_SSE(a,b1,c,8);

    dot_int(a,b1,c1, 8);
    dot_int_restrict(a,b2,c2,8);
    dot_int_SSE(a,b3,c3,8);

}

The output (from MSVC)

Click to copy

2 2 2 2 2 2 2 2 //no overlap default
2 2 2 2 2 2 2 2 //no overlap with manual SSE vector code
2 3 4 5 6 7 8 9 //overlap default
2 3 4 5 6 7 8 9 //overlap with restrict
3 2 2 2 1 1 1 1 //manual SSE vector code

Edit:

Here is another inserting version which produces much simpler code

Click to copy

void dot_int(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
    a = (int*)__builtin_assume_aligned (a, 16);
    b = (int*)__builtin_assume_aligned (b, 16);
    c = (int*)__builtin_assume_aligned (c, 16);
    for(int i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

710

asked May 14 '14 09:05

Z boson

1 Answers

I don't get what the problem is. Testing on Linux/64 bit, GCC 4.6, -O3, -mtune=native, -msse4.1 (i.e. a very old compiler/system), this code

Click to copy

void dot_int(int *a, int *b, int *c, int n) {
    for(int i=0; i<n; ++i) {
        c[i] = a[i] + b[i];
    }
}

compiles to this inner loop:

Click to copy

.L4:
    movdqu  (%rdi,%rax), %xmm1
    addl    $1, %r8d
    movdqu  (%rsi,%rax), %xmm0
    paddd   %xmm1, %xmm0
    movdqu  %xmm0, (%rdx,%rax)
    addq    $16, %rax
    cmpl    %r8d, %r10d
    ja      .L4
    cmpl    %r9d, %ecx
    je      .L1

While this code

Click to copy

void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
    for(int i=0; i<n; ++i) {
        c[i] = a[i] + b[i];
    }
}

compiles to this:

Click to copy

.L15:
    movdqu  (%rbx,%rax), %xmm0
    addl    $1, %r8d
    paddd   0(%rbp,%rax), %xmm0
    movdqu  %xmm0, (%r11,%rax)
    addq    $16, %rax
    cmpl    %r10d, %r8d
    jb      .L15
    addl    %r12d, %r9d
    cmpl    %r12d, %r13d
    je      .L10

As you can clearly see there's one less load. I guess it correclty estimated that there's no need to explicitely load memory before performing the sum, as the result won't overwrite anythng.

There's also room for way more optimizations -- GCC doesn't know that the parameters are f.i. 128 bit aligned, hence it must generate a huge preamble to check that there are no alignment issues (YMMV), and a postable to deal with extra unaligned parts (or less wide than 128 bits). This actually happens with both versions above. This is the complete code generated for dot_int:

Click to copy

dot_int:
.LFB626:
        .cfi_startproc
        testl   %ecx, %ecx
        pushq   %rbx
        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16
        jle     .L1
        leaq    16(%rdx), %r11
        movl    %ecx, %r10d
        shrl    $2, %r10d
        leal    0(,%r10,4), %r9d
        testl   %r9d, %r9d
        je      .L6
        leaq    16(%rdi), %rax
        cmpl    $6, %ecx
        seta    %r8b
        cmpq    %rax, %rdx
        seta    %al
        cmpq    %r11, %rdi
        seta    %bl
        orl     %ebx, %eax
        andl    %eax, %r8d
        leaq    16(%rsi), %rax
        cmpq    %rax, %rdx
        seta    %al
        cmpq    %r11, %rsi
        seta    %r11b
        orl     %r11d, %eax
        testb   %al, %r8b
        je      .L6
        xorl    %eax, %eax
        xorl    %r8d, %r8d
        .p2align 4,,10
        .p2align 3
.L4:
        movdqu  (%rdi,%rax), %xmm1
        addl    $1, %r8d
        movdqu  (%rsi,%rax), %xmm0
        paddd   %xmm1, %xmm0
        movdqu  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpl    %r8d, %r10d
        ja      .L4
        cmpl    %r9d, %ecx
        je      .L1
.L3:
        movslq  %r9d, %r8
        xorl    %eax, %eax
        salq    $2, %r8
        addq    %r8, %rdx
        addq    %r8, %rdi
        addq    %r8, %rsi
        .p2align 4,,10
        .p2align 3
.L5:
        movl    (%rdi,%rax,4), %r8d
        addl    (%rsi,%rax,4), %r8d
        movl    %r8d, (%rdx,%rax,4)
        addq    $1, %rax
        leal    (%r9,%rax), %r8d
        cmpl    %r8d, %ecx
        jg      .L5
.L1:
        popq    %rbx
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
.L6:
        .cfi_restore_state
        xorl    %r9d, %r9d
        jmp     .L3
        .cfi_endproc

Now in your case the ints effectively not aligned (as they're on the stack), but if you can make them aligned and tell GCC so, then you can improve code generation:

Click to copy

typedef int intvec __attribute__((vector_size(16)));

void dot_int_restrict_alig(intvec * restrict a, 
                           intvec * restrict b, 
                           intvec * restrict c, 
                           unsigned int n) {
    for(unsigned int i=0; i<n; ++i) {
        c[i] = a[i] + b[i];
    }
}

This generates this code, with no preamble:

Click to copy

dot_int_restrict_alig:
.LFB628:
        .cfi_startproc
        testl   %ecx, %ecx
        je      .L23
        subl    $1, %ecx
        xorl    %eax, %eax
        addq    $1, %rcx
        salq    $4, %rcx
        .p2align 4,,10
        .p2align 3
.L25:
        movdqa  (%rdi,%rax), %xmm0
        paddd   (%rsi,%rax), %xmm0
        movdqa  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rcx, %rax
        jne     .L25
.L23:
        rep
        ret
        .cfi_endproc

Note the usage of the aligned 128 bit load instructions (movdqa, a as aligned, vs movdqu, unaligned).

200

answered Nov 15 '22 12:11

peppe

Related questions
                            
                                Is there a variant of min_element which takes a projection function?
                            
                                C++: Default initialize an integral type in a template function
                            
                                Filter the types of a parameter pack
                            
                                Windows Event Viewer holds a lock on my EXE file
                            
                                Template functions in namespace cause errors
                            
                                Passing struct from unmanaged C++ to C#
                            
                                template method and default template argument
                            
                                How to declare wchar_t and set its string value later on?
                            
                                Structure or class which is better for linked list?
                            
                                How to encode large complex, constant data structures in C++
                            
                                How to impersonate a user from a service correctly?
                            
                                A "generalized" finite state machine implementation
                            
                                Is it possible for QTestLib to display the GUI it is testing as it runs?
                            
                                Uniform initialization fails to copy when object has no data members
                            
                                Can I link a static library built with the v120_xp toolset into an EXE/DLL built with the v120 toolset in VS2013?
                            
                                C++11/14: How to remove a pointer-to-member from a type?
                            
                                Can I explicitly specify template arguments for an operator?
                            
                                Number of swaps in a permutation [duplicate]
                            
                                g++ : can't link with a main executable file
                            
                                C++ aggregates have no virtual functions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sum of overlapping arrays, auto-vectorization, and restrict

Tags:

c++

c

optimization

auto-vectorization

sse

Z boson

People also ask

1 Answers

peppe

Recent Activity

Donate For Us