I know that it's said when passing a variable of any integral type like int, double, long double, etc to a function; it should be done by value but I'm curious that from an assembly point(performance-wise or space-wise), wouldn't be there a situation when passing a variable of an integral type with a size bigger than pointers like long double on my platform which has a size of 8 bytes and has bigger size than pointers that have a size of 4 bytes; by reference would be more efficient?
Passing a pointer/reference to an integer value larger than the native pointer size might well be locally optimal but it's difficult to say if it would be globally optimal. This is largely down to the callee's use of the value. If it's truly an integer and treated as such by the callee it's likely that, at some point, the value is going to be loaded into one or more registers anyway (in order for the program to perform arithmetic on the values, for example) incurring additional overhead in the callee to dereference the pointer. If the callee is inlined by an optimizing compiler it is possible that the compiler will simply pass the integer value split across two registers. If, however, the callee cannot be inlined (if it's third-party API code, for example) then the compiler cannot perform this kind of inlining and indeed passing a pointer might be more efficient, though it's unlikely you'll find library that functions that take an integer pass by reference unless it's so that the callee can modify the caller's value: which introduces a whole different set of issues.
More often than not a modern optimizing compiler will make a close to optimal decision taking all of these kinds of things into consideration and it is usually best for the programmer not to try to preempt the compiler with premature optimization. In fact, this may lead to less efficient code.
The most sensible thing to do in the vast majority of cases is to write your code in the way that best communicates your intent (pass-by-value for "value" types unless the argument is - adopting C# terminology - semantically an "out" or "reference" parameter) and worry about efficiency only if there is a clear performance bottleneck.
Test, test, test, dissassemble, dissassemble, dissassemble.
Simple, native sized integer.
unsigned int fun_one ( unsigned int a ) { return((a&7)+1); } unsigned int fun_two ( unsigned int *a ) { return((*a&7)+1); }
No optimization, you have one extra instruction when passing by reference in order to load the value at that address to do something with it.
00000000 : 0: e52db004 push {fp} ; (str fp, [sp, #-4]!) 4: e28db000 add fp, sp, #0 8: e24dd00c sub sp, sp, #12 c: e50b0008 str r0, [fp, #-8] 10: e51b3008 ldr r3, [fp, #-8] 14: e2033007 and r3, r3, #7 18: e2833001 add r3, r3, #1 1c: e1a00003 mov r0, r3 20: e28bd000 add sp, fp, #0 24: e49db004 pop {fp} ; (ldr fp, [sp], #4) 28: e12fff1e bx lr 0000002c : 2c: e52db004 push {fp} ; (str fp, [sp, #-4]!) 30: e28db000 add fp, sp, #0 34: e24dd00c sub sp, sp, #12 38: e50b0008 str r0, [fp, #-8] 3c: e51b3008 ldr r3, [fp, #-8] 40: e5933000 ldr r3, [r3] 44: e2033007 and r3, r3, #7 48: e2833001 add r3, r3, #1 4c: e1a00003 mov r0, r3 50: e28bd000 add sp, fp, #0 54: e49db004 pop {fp} ; (ldr fp, [sp], #4) 58: e12fff1e bx lr
Optimization, -O1 through -O3 gave the same result. And you still lose an instruction loading the value.
00000000 : 0: e2000007 and r0, r0, #7 4: e2800001 add r0, r0, #1 8: e12fff1e bx lr 0000000c : c: e5900000 ldr r0, [r0] 10: e2000007 and r0, r0, #7 14: e2800001 add r0, r0, #1 18: e12fff1e bx lr
And it is going to continue like that for pretty much any single sized thing you can pass in. 64 bit integeters, you still burn the extra instruction and memory cycles loading from the reference into registers in order to operate. Any array of somethings well you cannot really do a pass by value can you? But a struct you can, and reaching into a struct, reference or not, is going to require some addressing probably.
typedef struct { unsigned int a; unsigned int b; char c[4]; } ruct; unsigned int fun_one ( ruct a ) { return((a.c[3]&7)+1); } unsigned int fun_two ( ruct *a ) { return((a->c[3]&7)+1); }
With no optimization we start with a tie 12 instructions each. I would have to stare at it more to decide if one burns more clock cycles than the other.
00000000 : 0: e52db004 push {fp} ; (str fp, [sp, #-4]!) 4: e28db000 add fp, sp, #0 8: e24dd014 sub sp, sp, #20 c: e24b3010 sub r3, fp, #16 10: e8830007 stm r3, {r0, r1, r2} 14: e55b3005 ldrb r3, [fp, #-5] 18: e2033007 and r3, r3, #7 1c: e2833001 add r3, r3, #1 20: e1a00003 mov r0, r3 24: e28bd000 add sp, fp, #0 28: e49db004 pop {fp} ; (ldr fp, [sp], #4) 2c: e12fff1e bx lr 00000030 : 30: e52db004 push {fp} ; (str fp, [sp, #-4]!) 34: e28db000 add fp, sp, #0 38: e24dd00c sub sp, sp, #12 3c: e50b0008 str r0, [fp, #-8] 40: e51b3008 ldr r3, [fp, #-8] 44: e5d3300b ldrb r3, [r3, #11] 48: e2033007 and r3, r3, #7 4c: e2833001 add r3, r3, #1 50: e1a00003 mov r0, r3 54: e28bd000 add sp, fp, #0 58: e49db004 pop {fp} ; (ldr fp, [sp], #4) 5c: e12fff1e bx lr
But look what happens with optimization. The structure was sized such that it fit in registers when passed in.
00000000 : 0: e24dd010 sub sp, sp, #16 4: e28d3004 add r3, sp, #4 8: e8830007 stm r3, {r0, r1, r2} c: e5dd100f ldrb r1, [sp, #15] 10: e2010007 and r0, r1, #7 14: e2800001 add r0, r0, #1 18: e28dd010 add sp, sp, #16 1c: e12fff1e bx lr 00000020 : 20: e5d0100b ldrb r1, [r0, #11] 24: e2010007 and r0, r1, #7 28: e2800001 add r0, r0, #1 2c: e12fff1e bx lr
Sadly gcc did not do a very good job optimizing this one, could have done a shift and and in one instruction on r3, an add, and the bx,lr, three instructions, beating out the pass by reference.
You need to know the compiler and the interface, does it pass arguments in registers or always on the stack? If registers are used what does it do if your arguments need more space than the reserved registers can handle, does it fill them up then use the stack, does it just use the stack and no registers? Does it pass a pointer to memory holding the argument, pass by reference style, but such that the value passed in is protected.
You also have to look beyond the individual functions as to how much memory and register work has to happen to prepare the call to the function. Pass by reference for the structure example would be a single load or immediate to fill the one register with the address of the structure. The pass by value of the structure, in the case of an ARM would be a single instruction to load the three registers with the structure, but it takes potentially three clock cycles (or 6 or 2 depending on the amba/axi bus). Other processors it might cost you three instructions plus a data clock cycle for each register. So even if gcc had done a better job optimizing the pass by value structure example, the pass by reference could have just edged it out by a clock cycle or two but that depends heavily on what the code in the calling function looks like. To really know you have to test it by timing the code accurately, and disassemble to understand why it gets faster or slower as you tune it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With