Why does gcc fill the whole array with zeros instead of only the remaining 96 integers? The non-zero initializers are all at the start of the array. <pre class="prettyprint"><code>void *sink; void bar() { int a[100]{1,2,3,4}; sink = a; // a escapes the function asm("":::"memory"); // and compiler memory barrier // forces the compiler to materialize a[] in memory instead of optimizing away } </code></pre> MinGW8.1 and gcc9.2 both make asm like this (Godbolt compiler explorer). <pre class="prettyprint"><code># gcc9.2 -O3 -m32 -mno-sse bar(): push edi # save call-preserved EDI which rep stos uses xor eax, eax # eax=0 mov ecx, 100 # repeat-count = 100 sub esp, 400 # reserve 400 bytes on the stack mov edi, esp # dst for rep stos mov DWORD PTR sink, esp # sink = a rep stosd # memset(a, 0, 400) mov DWORD PTR [esp], 1 # then store the non-zero initializers mov DWORD PTR [esp+4], 2 # over the zeroed part of the array mov DWORD PTR [esp+8], 3 mov DWORD PTR [esp+12], 4 # memory barrier empty asm statement is here. add esp, 400 # cleanup the stack pop edi # and restore caller's EDI ret </code></pre> (with SSE enabled it would copy all 4 initializers with movdqa load/store) Why doesn't GCC do <code>lea edi, [esp+16]</code> and memset (with <code>rep stosd</code>) only the last 96 elements, like Clang does? Is this a missed optimization, or is it somehow more efficient to do it this way? (Clang actually calls <code>memset</code> instead of inlining <code>rep stos</code>) <hr> Editor's note: the question originally had un-optimized compiler output which worked the same way, but inefficient code at <code>-O0</code> doesn't prove anything. But it turns out that this optimization is missed by GCC even at <code>-O3</code>. Passing a pointer to <code>a</code> to a non-inline function would be another way to force the compiler to materialize <code>a[]</code>, but in 32-bit code that leads to significant clutter of the asm. (Stack args result in pushes, which gets mixed in with stores to the stack to init the array.) Using <code>volatile a[100]{1,2,3,4}</code> gets GCC to create and then copy the array, which is insane. Normally <code>volatile</code> is good for looking at how compilers init local variables or lay them out on the stack.

In theory your initialization could look like that: <pre class="prettyprint"><code>int a[100] = { [3] = 1, [5] = 42, [88] = 1, }; </code></pre> so it may be more effective in sense of cache and optimizablity to first zero out the whole memory block and then set individual values. May be the behavior changes depending on: <ul> <li>target architecture</li> <li>target OS </li> <li>array length</li> <li>initialization ratio (explicitly initialized values/length)</li> <li>positions of the initialized values</li> </ul> Of course, in your case the initialization are compacted at the start of the array and the optimization would be trivial. So it seems that gcc is doing the most generic approach here. Looks like a missing optimization.

Why does GCC aggregate initialization of an array fill the whole thing with zeros first, including non-zero elements?

Tags:

Why does gcc fill the whole array with zeros instead of only the remaining 96 integers? The non-zero initializers are all at the start of the array.

void *sink;
void bar() {
    int a[100]{1,2,3,4};
    sink = a;             // a escapes the function
    asm("":::"memory");   // and compiler memory barrier
    // forces the compiler to materialize a[] in memory instead of optimizing away
}

MinGW8.1 and gcc9.2 both make asm like this (Godbolt compiler explorer).

# gcc9.2 -O3 -m32 -mno-sse
bar():
    push    edi                       # save call-preserved EDI which rep stos uses
    xor     eax, eax                  # eax=0
    mov     ecx, 100                  # repeat-count = 100
    sub     esp, 400                  # reserve 400 bytes on the stack
    mov     edi, esp                  # dst for rep stos
        mov     DWORD PTR sink, esp       # sink = a
    rep stosd                         # memset(a, 0, 400) 

    mov     DWORD PTR [esp], 1        # then store the non-zero initializers
    mov     DWORD PTR [esp+4], 2      # over the zeroed part of the array
    mov     DWORD PTR [esp+8], 3
    mov     DWORD PTR [esp+12], 4
 # memory barrier empty asm statement is here.

    add     esp, 400                  # cleanup the stack
    pop     edi                       # and restore caller's EDI
    ret

(with SSE enabled it would copy all 4 initializers with movdqa load/store)

Why doesn't GCC do lea edi, [esp+16] and memset (with rep stosd) only the last 96 elements, like Clang does? Is this a missed optimization, or is it somehow more efficient to do it this way? (Clang actually calls memset instead of inlining rep stos)

Editor's note: the question originally had un-optimized compiler output which worked the same way, but inefficient code at -O0 doesn't prove anything. But it turns out that this optimization is missed by GCC even at -O3.

Passing a pointer to a to a non-inline function would be another way to force the compiler to materialize a[], but in 32-bit code that leads to significant clutter of the asm. (Stack args result in pushes, which gets mixed in with stores to the stack to init the array.)

Using volatile a[100]{1,2,3,4} gets GCC to create and then copy the array, which is insane. Normally volatile is good for looking at how compilers init local variables or lay them out on the stack.

952

asked Nov 24 '19 20:11

Lassie

1 Answers

In theory your initialization could look like that:

int a[100] = {
  [3] = 1,
  [5] = 42,
  [88] = 1,
};

so it may be more effective in sense of cache and optimizablity to first zero out the whole memory block and then set individual values.

May be the behavior changes depending on:

target architecture
target OS
array length
initialization ratio (explicitly initialized values/length)
positions of the initialized values

Of course, in your case the initialization are compacted at the start of the array and the optimization would be trivial.

So it seems that gcc is doing the most generic approach here. Looks like a missing optimization.

143

answered Sep 17 '22 17:09

vlad_tepesch

Related questions
                            
                                Real-time audio analysis with R
                            
                                How to calculate text height without rendering anything to the DOM?
                            
                                How to fetch API data from Axios inside the getServerSideProps function in NextJS?
                            
                                How do you get server blocks <% %> to format well in Visual Studio?
                            
                                WCF timeouts are a nightmare
                            
                                iPad Video with Transparency
                            
                                WCF logging/tracing and activity id propagation using log4net or NLog
                            
                                Streaming live audio with jPlayer
                            
                                Extension method that extends T - bad practice?
                            
                                F# crashes on Mono 2.10
                            
                                How do I host multiple MVC3 sites on a single virtual host running Apache2?
                            
                                Trouble with Primefaces 3.0.M2 SelectOneMenu Ajax behavior

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With