I have read some questions about returning more than one value such as What is the reason behind having only one return value in C++ and Java?, Returning multiple values from a C++ function and Why do most programming languages only support returning a single value from a function?.
I agree with most of the arguments used to prove that more than one return value is not strictly necessary and I understand why such feature hasn't been implemented, but I still can't understand why can't we use multiple caller-saved registers such as ECX and EDX to return such values.
Wouldn't it be faster to use the registers instead of creating a Class/Struct to store those values or passing arguments by reference/pointers, both of which use memory to store them? If it is possible to do such thing, does any C/C++ compiler use this feature to speed up the code?
An ideal code would be like this:
(int, int) getTwoValues(void) { return 1, 2; }
int main(int argc, char** argv)
{
// a and b are actually returned in registers
// so future operations with a and b are faster
(int a, int b) = getTwoValues();
// do something with a and b
return 0;
}
Yes, this is sometimes done. If you read the Wikipedia page on x86 calling conventions under cdecl:
There are some variations in the interpretation of cdecl, particularly in how to return values. As a result, x86 programs compiled for different operating system platforms and/or by different compilers can be incompatible, even if they both use the "cdecl" convention and do not call out to the underlying environment. Some compilers return simple data structures with a length of 2 registers or less in the register pair EAX:EDX, and larger structures and class objects requiring special treatment by the exception handler (e.g., a defined constructor, destructor, or assignment) are returned in memory. To pass "in memory", the caller allocates memory and passes a pointer to it as a hidden first parameter; the callee populates the memory and returns the pointer, popping the hidden pointer when returning.
(emphasis mine)
Ultimately, it comes down to calling convention. It's possible for your compiler to optimize your code to use whatever registers it wants, but when your code interacts with other code (like the operating system), it needs to follow the standard calling conventions, which typically uses 1 register for returning values.
Returning in stack isn't necessarily slower, because once the values are available in L1 cache (which the stack often fulfills), accessing them will be very fast.
However in most computer architectures there are at least 2 registers to return values that are twice (or more) as wide as the word size (edx:eax
in x86, rdx:rax
in x86_64, $v0
and $v1
in MIPS (Why MIPS assembler has more that one register for return value?), R0:R3
in ARM1, X0:X7
in ARM64...). The ones that don't have are mostly microcontrollers with only one accumulator or a very limited number of registers.
1"If the type of value returned is too large to fit in r0 to r3, or whose size cannot be determined statically at compile time, then the caller must allocate space for that value at run time, and pass a pointer to that space in r0."
These registers can also be used for returning directly small structs that fits in 2 (or more depending on architecture and ABI) registers or less.
For example with the following code
struct Point
{
int x, y;
};
struct shortPoint
{
short x, y;
};
struct Point3D
{
int x, y, z;
};
Point P1()
{
Point p;
p.x = 1;
p.y = 2;
return p;
}
Point P2()
{
Point p;
p.x = 1;
p.y = 0;
return p;
}
shortPoint P3()
{
shortPoint p;
p.x = 1;
p.y = 0;
return p;
}
Point3D P4()
{
Point3D p;
p.x = 1;
p.y = 2;
p.z = 3;
return p;
}
Clang emits the following instructions for x86_64 as you can see here
P1(): # @P1()
movabs rax, 8589934593
ret
P2(): # @P2()
mov eax, 1
ret
P3(): # @P3()
mov eax, 1
ret
P4(): # @P4()
movabs rax, 8589934593
mov edx, 3
ret
For ARM64:
P1():
mov x0, 1
orr x0, x0, 8589934592
ret
P2():
mov x0, 1
ret
P3():
mov w0, 1
ret
P4():
mov x1, 1
mov x0, 0
sub sp, sp, #16
bfi x0, x1, 0, 32
mov x1, 2
bfi x0, x1, 32, 32
add sp, sp, 16
mov x1, 3
ret
As you can see, no stack operations are involved. You can switch to other compilers to see that the values are mainly returned on registers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With