Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

would ever passing a variable of an integral type to functions by reference be more efficient than by value?

I know that it's said when passing a variable of any integral type like int, double, long double, etc to a function; it should be done by value but I'm curious that from an assembly point(performance-wise or space-wise), wouldn't be there a situation when passing a variable of an integral type with a size bigger than pointers like long double on my platform which has a size of 8 bytes and has bigger size than pointers that have a size of 4 bytes; by reference would be more efficient?

like image 637
Pooria Avatar asked Sep 27 '10 20:09

Pooria


2 Answers

Passing a pointer/reference to an integer value larger than the native pointer size might well be locally optimal but it's difficult to say if it would be globally optimal. This is largely down to the callee's use of the value. If it's truly an integer and treated as such by the callee it's likely that, at some point, the value is going to be loaded into one or more registers anyway (in order for the program to perform arithmetic on the values, for example) incurring additional overhead in the callee to dereference the pointer. If the callee is inlined by an optimizing compiler it is possible that the compiler will simply pass the integer value split across two registers. If, however, the callee cannot be inlined (if it's third-party API code, for example) then the compiler cannot perform this kind of inlining and indeed passing a pointer might be more efficient, though it's unlikely you'll find library that functions that take an integer pass by reference unless it's so that the callee can modify the caller's value: which introduces a whole different set of issues.

More often than not a modern optimizing compiler will make a close to optimal decision taking all of these kinds of things into consideration and it is usually best for the programmer not to try to preempt the compiler with premature optimization. In fact, this may lead to less efficient code.

The most sensible thing to do in the vast majority of cases is to write your code in the way that best communicates your intent (pass-by-value for "value" types unless the argument is - adopting C# terminology - semantically an "out" or "reference" parameter) and worry about efficiency only if there is a clear performance bottleneck.

like image 171
Richard Cook Avatar answered Oct 05 '22 23:10

Richard Cook


Test, test, test, dissassemble, dissassemble, dissassemble.

Simple, native sized integer.

unsigned int fun_one ( unsigned int a )
{
    return((a&7)+1);
}

unsigned int fun_two ( unsigned int *a )
{
    return((*a&7)+1);
}

No optimization, you have one extra instruction when passing by reference in order to load the value at that address to do something with it.

00000000 :
   0:   e52db004    push    {fp}        ; (str fp, [sp, #-4]!)
   4:   e28db000    add fp, sp, #0
   8:   e24dd00c    sub sp, sp, #12
   c:   e50b0008    str r0, [fp, #-8]
  10:   e51b3008    ldr r3, [fp, #-8]
  14:   e2033007    and r3, r3, #7
  18:   e2833001    add r3, r3, #1
  1c:   e1a00003    mov r0, r3
  20:   e28bd000    add sp, fp, #0
  24:   e49db004    pop {fp}        ; (ldr fp, [sp], #4)
  28:   e12fff1e    bx  lr

0000002c :
  2c:   e52db004    push    {fp}        ; (str fp, [sp, #-4]!)
  30:   e28db000    add fp, sp, #0
  34:   e24dd00c    sub sp, sp, #12
  38:   e50b0008    str r0, [fp, #-8]
  3c:   e51b3008    ldr r3, [fp, #-8]
  40:   e5933000    ldr r3, [r3]
  44:   e2033007    and r3, r3, #7
  48:   e2833001    add r3, r3, #1
  4c:   e1a00003    mov r0, r3
  50:   e28bd000    add sp, fp, #0
  54:   e49db004    pop {fp}        ; (ldr fp, [sp], #4)
  58:   e12fff1e    bx  lr

Optimization, -O1 through -O3 gave the same result. And you still lose an instruction loading the value.

  
00000000 :
   0:   e2000007    and r0, r0, #7
   4:   e2800001    add r0, r0, #1
   8:   e12fff1e    bx  lr

0000000c :
   c:   e5900000    ldr r0, [r0]
  10:   e2000007    and r0, r0, #7
  14:   e2800001    add r0, r0, #1
  18:   e12fff1e    bx  lr

And it is going to continue like that for pretty much any single sized thing you can pass in. 64 bit integeters, you still burn the extra instruction and memory cycles loading from the reference into registers in order to operate. Any array of somethings well you cannot really do a pass by value can you? But a struct you can, and reaching into a struct, reference or not, is going to require some addressing probably.

typedef struct
{
    unsigned int a;
    unsigned int b;
    char c[4];
} ruct;

unsigned int fun_one ( ruct a )
{
    return((a.c[3]&7)+1);
}

unsigned int fun_two ( ruct *a )
{
    return((a->c[3]&7)+1);
}

With no optimization we start with a tie 12 instructions each. I would have to stare at it more to decide if one burns more clock cycles than the other.

00000000 :
   0:   e52db004    push    {fp}        ; (str fp, [sp, #-4]!)
   4:   e28db000    add fp, sp, #0
   8:   e24dd014    sub sp, sp, #20
   c:   e24b3010    sub r3, fp, #16
  10:   e8830007    stm r3, {r0, r1, r2}
  14:   e55b3005    ldrb    r3, [fp, #-5]
  18:   e2033007    and r3, r3, #7
  1c:   e2833001    add r3, r3, #1
  20:   e1a00003    mov r0, r3
  24:   e28bd000    add sp, fp, #0
  28:   e49db004    pop {fp}        ; (ldr fp, [sp], #4)
  2c:   e12fff1e    bx  lr

00000030 :
  30:   e52db004    push    {fp}        ; (str fp, [sp, #-4]!)
  34:   e28db000    add fp, sp, #0
  38:   e24dd00c    sub sp, sp, #12
  3c:   e50b0008    str r0, [fp, #-8]
  40:   e51b3008    ldr r3, [fp, #-8]
  44:   e5d3300b    ldrb    r3, [r3, #11]
  48:   e2033007    and r3, r3, #7
  4c:   e2833001    add r3, r3, #1
  50:   e1a00003    mov r0, r3
  54:   e28bd000    add sp, fp, #0
  58:   e49db004    pop {fp}        ; (ldr fp, [sp], #4)
  5c:   e12fff1e    bx  lr

But look what happens with optimization. The structure was sized such that it fit in registers when passed in.

  
00000000 :
   0:   e24dd010    sub sp, sp, #16
   4:   e28d3004    add r3, sp, #4
   8:   e8830007    stm r3, {r0, r1, r2}
   c:   e5dd100f    ldrb    r1, [sp, #15]
  10:   e2010007    and r0, r1, #7
  14:   e2800001    add r0, r0, #1
  18:   e28dd010    add sp, sp, #16
  1c:   e12fff1e    bx  lr

00000020 :
  20:   e5d0100b    ldrb    r1, [r0, #11]
  24:   e2010007    and r0, r1, #7
  28:   e2800001    add r0, r0, #1
  2c:   e12fff1e    bx  lr

Sadly gcc did not do a very good job optimizing this one, could have done a shift and and in one instruction on r3, an add, and the bx,lr, three instructions, beating out the pass by reference.

You need to know the compiler and the interface, does it pass arguments in registers or always on the stack? If registers are used what does it do if your arguments need more space than the reserved registers can handle, does it fill them up then use the stack, does it just use the stack and no registers? Does it pass a pointer to memory holding the argument, pass by reference style, but such that the value passed in is protected.

You also have to look beyond the individual functions as to how much memory and register work has to happen to prepare the call to the function. Pass by reference for the structure example would be a single load or immediate to fill the one register with the address of the structure. The pass by value of the structure, in the case of an ARM would be a single instruction to load the three registers with the structure, but it takes potentially three clock cycles (or 6 or 2 depending on the amba/axi bus). Other processors it might cost you three instructions plus a data clock cycle for each register. So even if gcc had done a better job optimizing the pass by value structure example, the pass by reference could have just edged it out by a clock cycle or two but that depends heavily on what the code in the calling function looks like. To really know you have to test it by timing the code accurately, and disassemble to understand why it gets faster or slower as you tune it.

like image 20
old_timer Avatar answered Oct 05 '22 22:10

old_timer