Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why aren't those function calls optimized?

I've tried compiling this code both with Clang and GCC:

struct s { int _[50]; };

void (*pF)(const struct s), (*pF1)(struct s), (*pF2)(struct s *);

main()
{
    struct s a;

    pF2(&a);

    pF(a), pF1(a);
}

The result is the same. Although the call to pF isn't allowed to modify its only argument, the object a is copied for the second call to pF1. Why is this?

Here is the assembly output (from GCC):

; main

        push    rbx
        sub rsp, 0D0h
        mov rbx, rsp
        mov rdi, rsp
        call    cs:pF2

    ;create argument for pF1 call (as there the argument is modified)
    ;and copy the local a into it
    ;although it seems not needed because the local isn't futher read anyway

        sub rsp, 0D0h
        mov rsi, rbx
        mov ecx, 19h
        mov rdi, rsp
    ;

        rep movsq
        call    cs:pF

    ;copy the local a into the argument created once again
    ;though the argument cannot be modified by the function pointed by pF

        mov rdi, rsp
        mov rsi, rbx
        mov ecx, 19h
        rep movsq
    ;

        call    cs:pF1
        add rsp, 1A0h
        xor eax, eax
        pop rbx
        retn

Can't the optimizer see that as the function pointed by pF can't modify its parameter (as it's declared const) and so omit the last copying operation? Also, recently I saw that, as the variable a isn't further read in the code, it could use its storage for the function arguments.

The same code can be written as:

; main

            push    rbx
            sub rsp, 0D0h

            mov rdi, rsp
            call    cs:pF2

            call    cs:pF

            call    cs:pF1
            add rsp, 0D0h
            xor eax, eax
            pop rbx
            retn

I'm compiling with -O3 flag. Am I missing something?

It's the same even if I don't invoke UB (as the functions pointers are by default NULL) and I instead initialize them like:

#include <stdio.h>

struct s { int _[50]; };

extern void f2(struct s *a);

void (*pF)(const struct s), (*pF1)(struct s), (*pF2)(struct s *) = f2;

extern void f1(struct s a)
{
    a._[2] = 90;
}

extern void f(const struct s a)
{
    for(size_t i = 0; i < sizeof(a._)/sizeof(a._[0]); ++i)
        printf("%d\n", a._[i]);
}

extern void f2(struct s *a)
{
    a->_[6] = 90;

    pF1 = f1, pF = f;
}
like image 699
AnArrayOfFunctions Avatar asked Mar 08 '16 12:03

AnArrayOfFunctions


2 Answers

I don't believe this optimisation is legal. What you are overlooking is that a function type with a const argument is compatible with a function type with a non-const argument, so a function that mutates its argument may be assigned to the pointer pF.

Here is an example program:

struct s {
    int x;
};

/* Black hole so that DCE doesn't eat everything */
void observe(void *);

void (*pF)(const struct s);

void test(struct s arg) {
    arg.x = 0;
    observe(&arg);
}

void assignment(void) {
    pF = test;
}

The bottom line is that a const annotation for the argument gives the compiler no reliable information about whether argument storage is mutated by the callee. Performing this optimisation would seem to require that the ABI be such that argument storage is required to not be mutated (or some kind of whole program analysis, but never mind that).

like image 109
gsg Avatar answered Nov 17 '22 06:11

gsg


I think the function still needs to make one copy (see the end for what I think is the most optimal allowable version). The rest are (more or less understandable) optimization failures.


The SysV x86-64 ABI doesn't guarantee that a function won't modify its stack-args. It doesn't say anything about const. Anything it doesn't guarantee can't be assumed. It just says that large objects passed by value go on the stack; nothing about the state when the called function returns. The callee "owns" its args, even if they're declared const. See also the x86 wiki, but the ABI doc itself is the only link in the wiki that's really relevant.

Similarly, narrow integer types can be in registers with garbage in the high bits as args or return values. The ABI doesn't explicitly say anything either way, so there is no guarantee that high bits are zeroed. This is in fact what gcc does: it assumes there is high garbage when receiving values, and will leave high garbage when passing values. Same goes for float/double in xmm regs. I confirmed this with one of the ABI maintainers recently, while investigating some unsafe code generated by clang. So I'm sure that the correct interpretation is that you shouldn't assume anything not explicitly guaranteed by the ABI.


gcc doesn't do this, but I believe it would be legal for a called function like this to not actually make a copy:

void modifyconstarg(const struct s x) {
  // x.arr[10] = 10;  // This is a compile-time error
  struct s xtmp = x;  // gcc/clang: make a full copy before this
  xtmp.arr[11]=11;
  pFconstval(xtmp);   // gcc/clang: make a full copy here
}

Instead, just store into its arg and jmp pFconstval.

My guess is that it's a missed optimization, rather than gcc and clang being conservative in their interpretation of the standard.


It appears that gcc and clang don't do a great job at optimizing away copies for objects too large to fit in a register. Source code that didn't copy them in the first place would be even better than the best job the compiler could do with this (e.g. pass by const * or C++ const-reference), since I don't think your suggested optimization is legal.

gcc and clang do significantly worse than the best legal optimization, though: see the output on godbolt.

Weird thing: with -march=haswell (or any other Intel CPU), gcc emits a function call to memcpy instead of rep movsq inline code. I don't get it. It does this even with -ffreestanding / -nostdlib

IDK if anyone else kept thinking that rdi was a pointer to the memory, i.e. that it was passed by invisible reference. It took me ages to fully grok that the call-by-value functions don't take any parameters in registers at all. I kept thinking it was weird that rep movsq left rdi pointing to the high copy.


You don't need function pointers to reproduce this; normal functions with prototypes (and more descriptive names) still demonstrates it.

struct s { int _[50]; };

//void (*pFconstval)(const struct s), (*pFval)(struct s), (*pFref)(struct s *);
void pFref(struct s *);  void pFconstval(const struct s), pFval(struct s);

void func(void) {
    struct s a;
    pFref(&a);
    pFconstval(a);  pFval(a);
}

void modifyconstarg(const struct s x) {
  // x.arr[10] = 10;  // This is a compile-time error
  struct s xtmp = x;  // full copy here
  xtmp.arr[11]=11;
  pFconstval(xtmp);   // full copy here
}

void modifyarg(struct s x) {
  x.arr[10] = 10;
  pFconstval(x);
}

gcc's output for modifyarg is hilarious:

    lea     rdi, [rsp+8]
    mov     DWORD PTR [rsp+48], 10
    mov     ecx, 25
    mov     rsi, rdi                ; src=dest
    rep movsq                       ; in-place "copy"
    jmp     pFconstval

It does the copy even if you don't modify x. Clang makes an actual copy to a different location before the tailcall jmp.


The best legal version of your function

as I understand the ABI:

    sub     rsp, 416
    mov     rdi, rsp
    call    pFref               ; or call [pF2] if using function pointers.  Is your disassembly in MASM syntax?
    lea     rdi, [rsp+208]      ; aligned by 16 for hopefully better rep movsq perf
                                ; and so the stack is aligned by 16 at each location
    mov     rsi, rsp
    mov     ecx, 25
    rep movsq
    call    pFconstval          ; clobbering the low copy
    add     rsp, 208
    call    pFval               ; clobbering the remaining high copy
    add     rsp, 208
    ret

BTW, gcc's use of rbx is silly. It saves four code bytes:
push/pop: 2 bytes. mov rbx, rsp: 3B. 2x mov rsi, rbx: 2x3B. Total = 12B

replacing all that with 2x lea rsi, [rsp+208]: 2x 8B. Total = 16B.

It doesn't avoid the extra stack-engine sync uop since mov rdi, rsp is also used. 4B of code isn't worth spending 3 uops. In my version, which only copies once (and only needs one LEA), it's also a loss in code bytes.

like image 41
Peter Cordes Avatar answered Nov 17 '22 08:11

Peter Cordes