Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization:

#include <arm_neon.h>

void bug(int8_t *out, const int8_t *in) {
    for (int i = 0; i < 2; i++) {
        int8x16x4_t x;

        x.val[0] = vld1q_s8(&in[16 * i]);
        x.val[1] = x.val[2] = x.val[3] = vshrq_n_s8(x.val[0], 7);

        vst4q_s8(&out[64 * i], x);
    }
}

NOTE: this is a minimally reproducible version of an issue that is popping up in many different functions of my actual, much more complex code, filled with arithmetic/logical/permutation instructions performing a totally different operation from above. Please refrain from criticizing and/or suggesting different ways of doing what the code above does, unless it has an effect on the code generation issue discussed below.

clang generates sane code:

bug(signed char*, signed char const*):                            // @bug(signed char*, signed char const*)
        ldr     q0, [x1]
        sshr    v1.16b, v0.16b, #7
        mov     v2.16b, v1.16b
        mov     v3.16b, v1.16b
        st4     { v0.16b, v1.16b, v2.16b, v3.16b }, [x0], #64
        ldr     q0, [x1, #16]
        sshr    v1.16b, v0.16b, #7
        mov     v2.16b, v1.16b
        mov     v3.16b, v1.16b
        st4     { v0.16b, v1.16b, v2.16b, v3.16b }, [x0]
        ret

As for gcc, it inserts a lot of unnecessary operations, apparently zeroing out the registers that will be eventually input to the st4 instruction:

bug(signed char*, signed char const*):
        sub     sp, sp, #128
        # mov     x9, 0
        # mov     x8, 0
        # mov     x7, 0
        # mov     x6, 0
        # mov     x5, 0
        # mov     x4, 0
        # mov     x3, 0
        # stp     x9, x8, [sp]
        # mov     x2, 0
        # stp     x7, x6, [sp, 16]
        # stp     x5, x4, [sp, 32]
        # str     x3, [sp, 48]
        ldr     q0, [x1]
        # stp     x2, x9, [sp, 56]
        # stp     x8, x7, [sp, 72]
        sshr    v4.16b, v0.16b, 7
        # str     q0, [sp]
        # ld1     {v0.16b - v3.16b}, [sp]
        # stp     x6, x5, [sp, 88]
        mov     v1.16b, v4.16b
        # stp     x4, x3, [sp, 104]
        mov     v2.16b, v4.16b
        # str     x2, [sp, 120]
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0], 64
        ### ldr     q4, [x1, 16]
        ### add     x1, sp, 64
        ### str     q4, [sp, 64]
        sshr    v4.16b, v4.16b, 7
        ### ld1     {v0.16b - v3.16b}, [x1]
        mov     v1.16b, v4.16b
        mov     v2.16b, v4.16b
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0]
        add     sp, sp, 128
        ret

I manually prefixed with # all instructions that could be safely taken out, without affecting the result of the function.

In addition, the instructions prefixed with ### perform an unnecessary trip to memory and back (and anyway, the mov instructions following ### ld1 ... overwrite 3 out of 4 registers loaded by that ld1 instruction), and could be replaced by a single load straight to v0.16b -- and the sshr instruction in the middle of the block would then use v0.16b as its source register.

As far as I know, x, being a local variable, can be used unitialized; and even if it weren't, all registers are properly initialized, so there's no point in zeroing them out just to immediately overwrite them with values.

I'm inclined to think this is a gcc bug, but before reporting it, I'm curious if I missed something. Maybe there's a compilation flag, an __attribute__ or something else that I could to make gcc generate sane code.

Thus, my question: is there anything I can do to generate sane code, or is this a bug I need to report to gcc?

like image 893
swineone Avatar asked Oct 07 '21 22:10

swineone


People also ask

What is the use of the arm option in GCC?

This option specifies the name of the target ARM processor for which GCC should tune the performance of the code. For some ARM implementations better performance can be obtained by using this option.

How to optimize GCC for embedded systems?

You can find all the details about optimization flags in the “Optimization Options” section of the GNU GCC docs 11. -Os, optimize for size, is generally the optimization flag you will see used for embedded systems. It enables a good balance of flags which optimize for size as well as speed.

What is the purpose of the CPU range option in GCC?

The aim is to generate code that run well on the current most popular processors, balancing between optimizations that benefit some CPUs in the range, and avoiding performance pitfalls of other CPUs. The effects of this option may change in future GCC versions as CPU models come and go.

Which floating-point operations are not generated by GCC's auto-vectorization pass?

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon ), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified.


1 Answers

Code generation on a fairly current development version of gcc appears to have improved immensely, at least for this case.

After installing the gcc-snapshot package (dated 20210918), gcc generates the following code:

bug:
        ldr     q5, [x1]
        sshr    v4.16b, v5.16b, 7
        mov     v0.16b, v5.16b
        mov     v1.16b, v4.16b
        mov     v2.16b, v4.16b
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0], 64
        ldr     q4, [x1, 16]
        mov     v0.16b, v4.16b
        sshr    v4.16b, v4.16b, 7
        mov     v1.16b, v4.16b
        mov     v2.16b, v4.16b
        mov     v3.16b, v4.16b
        st4     {v0.16b - v3.16b}, [x0]
        ret

Not ideal yet -- at least two mov instruction could be removed per iteration by changing the destination registers of ldr and sshr, but considerably better than before.

like image 53
swineone Avatar answered Oct 23 '22 04:10

swineone