Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Tags:

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization:

#include <arm_neon.h>

void bug(int8_t *out, const int8_t *in) {
    for (int i = 0; i < 2; i++) {
        int8x16x4_t x;

        x.val[0] = vld1q_s8(&in[16 * i]);
        x.val[1] = x.val[2] = x.val[3] = vshrq_n_s8(x.val[0], 7);

        vst4q_s8(&out[64 * i], x);
    }
}

NOTE: this is a minimally reproducible version of an issue that is popping up in many different functions of my actual, much more complex code, filled with arithmetic/logical/permutation instructions performing a totally different operation from above. Please refrain from criticizing and/or suggesting different ways of doing what the code above does, unless it has an effect on the code generation issue discussed below.

clang generates sane code:

bug(signed char*, signed char const*):                            // @bug(signed char*, signed char const*)
        ldr     q0, [x1]
        sshr    v1.16b, v0.16b, #7
        mov     v2.16b, v1.16b
        mov     v3.16b, v1.16b
        st4     { v0.16b, v1.16b, v2.16b, v3.16b }, [x0], #64
        ldr     q0, [x1, #16]
        sshr    v1.16b, v0.16b, #7
        mov     v2.16b, v1.16b
        mov     v3.16b, v1.16b
        st4     { v0.16b, v1.16b, v2.16b, v3.16b }, [x0]
        ret

As for gcc, it inserts a lot of unnecessary operations, apparently zeroing out the registers that will be eventually input to the st4 instruction:

bug(signed char*, signed char const*):
        sub     sp, sp, #128
        # mov     x9, 0
        # mov     x8, 0
        # mov     x7, 0
        # mov     x6, 0
        # mov     x5, 0
        # mov     x4, 0
        # mov     x3, 0
        # stp     x9, x8, [sp]
        # mov     x2, 0
        # stp     x7, x6, [sp, 16]
        # stp     x5, x4, [sp, 32]
        # str     x3, [sp, 48]
        ldr     q0, [x1]
        # stp     x2, x9, [sp, 56]
        # stp     x8, x7, [sp, 72]
        sshr    v4.16b, v0.16b, 7
        # str     q0, [sp]
        # ld1     {v0.16b - v3.16b}, [sp]
        # stp     x6, x5, [sp, 88]
        mov     v1.16b, v4.16b
        # stp     x4, x3, [sp, 104]
        mov     v2.16b, v4.16b
        # str     x2, [sp, 120]
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0], 64
        ### ldr     q4, [x1, 16]
        ### add     x1, sp, 64
        ### str     q4, [sp, 64]
        sshr    v4.16b, v4.16b, 7
        ### ld1     {v0.16b - v3.16b}, [x1]
        mov     v1.16b, v4.16b
        mov     v2.16b, v4.16b
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0]
        add     sp, sp, 128
        ret

I manually prefixed with # all instructions that could be safely taken out, without affecting the result of the function.

In addition, the instructions prefixed with ### perform an unnecessary trip to memory and back (and anyway, the mov instructions following ### ld1 ... overwrite 3 out of 4 registers loaded by that ld1 instruction), and could be replaced by a single load straight to v0.16b -- and the sshr instruction in the middle of the block would then use v0.16b as its source register.

As far as I know, x, being a local variable, can be used unitialized; and even if it weren't, all registers are properly initialized, so there's no point in zeroing them out just to immediately overwrite them with values.

I'm inclined to think this is a gcc bug, but before reporting it, I'm curious if I missed something. Maybe there's a compilation flag, an __attribute__ or something else that I could to make gcc generate sane code.

Thus, my question: is there anything I can do to generate sane code, or is this a bug I need to report to gcc?

893

asked Oct 07 '21 22:10

swineone

1 Answers

Code generation on a fairly current development version of gcc appears to have improved immensely, at least for this case.

After installing the gcc-snapshot package (dated 20210918), gcc generates the following code:

bug:
        ldr     q5, [x1]
        sshr    v4.16b, v5.16b, 7
        mov     v0.16b, v5.16b
        mov     v1.16b, v4.16b
        mov     v2.16b, v4.16b
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0], 64
        ldr     q4, [x1, 16]
        mov     v0.16b, v4.16b
        sshr    v4.16b, v4.16b, 7
        mov     v1.16b, v4.16b
        mov     v2.16b, v4.16b
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0]
        ret

Not ideal yet -- at least two mov instruction could be removed per iteration by changing the destination registers of ldr and sshr, but considerably better than before.

answered Oct 23 '22 04:10

swineone

Related questions
                            
                                Is there difference between scanf("%c",&x) and x=getchar()?
                            
                                Synchronizing two child processes with semaphores in c
                            
                                Delay loading dll in release mode
                            
                                Undefined behavior with pointer arithmetic on dynamically allocated memory
                            
                                What is ptr_munge in the apple argument to main?
                            
                                FFT Frequency Bins and PIC32
                            
                                Link keyrings in initramfs using syscall()
                            
                                When does SIGIO fire?
                            
                                libgit2 git_checkout_head with GIT_CHECKOUT_SAFE do nothing with working dir
                            
                                Example of an extended integer type?
                            
                                uint32_t * uint32_t = uint64_t vector multiplication with gcc
                            
                                How are FLT_DIG, DBL_DIG, and LDBL_DIG determined in C [duplicate]
                            
                                When does stack grow? How does OS know when to grow stack?
                            
                                How to fix Missing CSRF token in sentry
                            
                                How do you wrap a C function that returns a pointer to a malloc'd array with ctypes?
                            
                                How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?
                            
                                Differences of the inline-keyword in C and C++
                            
                                When calling a function in a DLL or .so why is passing structs reliable when compilers do their own struct layout with alignment and padding?
                            
                                Is pointers within structures slowing down my code?
                            
                                Linux Evdev Poll Lag

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Tags:

c

gcc

arm64

compiler-bug

neon

swineone

People also ask

1 Answers

swineone

Recent Activity

Donate For Us