There are three major calling conventions that are used with the C language on 32-bit x86 processors: STDCALL, CDECL, and FASTCALL. In addition, there is another calling convention typically used with C++: THISCALL. There are other calling conventions as well, including PASCAL and FORTRAN conventions, among others.
The default convention — shown above — is known as __cdecl. The other most popular convention is __stdcall. In it the parameters are again pushed by the caller, but the stack is cleaned up by the callee. It is the standard convention for Win32 API functions (as defined by the WINAPI macro in <windows.
In Linux, GCC sets the de facto standard for calling conventions. Since GCC version 4.5, the stack must be aligned to a 16-byte boundary when calling a function (previous versions only required a 4-byte alignment). A version of cdecl is described in System V ABI for i386 systems.
A calling convention is a scheme for how functions receive parameters from their caller and how they return a result. The calling conventions can differ in where parameters and return values are placed (in registers; on the call stack; a mix of both), the order they are placed.
One of the things to keep in mind about x86 is that the register name to "reg number" encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0...7 are - in that order - ?AX
, ?CX
, ?DX
, ?BX
, ?SP
, ?BP
, ?SI
, ?DI
.
Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the "classical" 32bit __fastcall
convention) is a logical choice. As far as going to 64bit is concerned, the "higher" regs are ordered, and both Microsoft and UN*X/Linux went for R8
/ R9
as the first ones.
Keeping that in mind, Microsoft's choice of RAX
(return value) and RCX
, RDX
, R8
, R9
(arg[0..3]) are an understandable selection if you choose four registers for arguments.
I don't know why the AMD64 UN*X ABI chose RDX
before RCX
.
UN*X, on RISC architectures, has traditionally done argument passing in registers - specifically, for the first six arguments (that's so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.
So if you want six registers to pass arguments in, and it's logical to choose RCX
, RDX
, R8
and R9
for four of them, which other two should you pick ?
The "higher" regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn't want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP
and RSP
these aren't available, and RBX
traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn't want to needlessly become incompatible with.
Ergo, the only choice were RSI
/ RDI
.
So if you have to take RSI
/ RDI
as argument registers, which arguments should they be ?
Making them arg[0]
and arg[1]
has some advantages. See cHao's comment.?SI
and ?DI
are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy()
function, for example, only consists of the two CPU instructions repz movsb; ret
because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated "glue" code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk()
, or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it'll be useful for code so frequently used to save the two or three CPU instructions that'd otherwise load such source/target address arguments into the "correct" registers.
So in a way, UN*X and Win64 are only different in that UN*X "prepends" two additional arguments, in purposefully chosen RSI
/RDI
registers, to the natural choice of four arguments in RCX
, RDX
, R8
and R9
.
There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:
http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx
Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn't call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it ... well, unless you're kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster's wikipedia reference points to.
IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.
It's interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers UAX
etc.
Also, feedback from kernel devs identified things that made the original design of syscall
and swapgs
unusable. That's how AMD updated the instruction to get this sorted out before releasing any actual chips. It's also interesting that in late 2000, the assumption was that Intel probably wouldn't adopt AMD64.
The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.
He's using the term "global" to mean call-preserved registers, that have to be push/popped if used.
The choice of rdi
, rsi
, rdx
as the first three args was motivated by:
memset
or other C string function on their args (where gcc inlines a rep string operation?)rbx
is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx
and rbp
) is a win. Presumably chosen because they're the only "legacy" registers that aren't implicitly used by any common instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).cmpxchg16b
and cpuid
need RBX, but are rarely used so not a big factor. (cmpxchg16b
wasn't part of original AMD64, but RBX would still have been the obvious choice. cmpxchg8b
exists but was obsoleted by qword cmpxchg
)We are trying to avoid RCX early in the sequence, since it is register used commonly for special purposes, like EAX, so it has same purpose to be missing in the sequence. Also it can't be used for syscalls and we would like to make syscall sequence to match function call sequence as much as possible.
(background: syscall
/ sysret
unavoidably destroy rcx
(with rip
) and r11
(with RFLAGS
), so the kernel can't see what was originally in rcx
when syscall
ran.)
The kernel system-call ABI was chosen to match the function call ABI, except for r10
instead of rcx
, so a libc wrapper functions like mmap(2)
can just mov %rcx, %r10
/ mov $0x9, %eax
/ syscall
.
Note that the SysV calling convention used by i386 Linux sucks compared to Window's 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax
for int64, not for small structs. It's no surprise little effort was made to maintain compatibility with it. When there's no reason not to, they did things like keeping rbx
call-preserved, since they decided that having another in the original 8 (that don't need a REX prefix) was good.
Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I'm not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax
, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.
They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn't do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.
I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdef
s in some ASM so the same source can more easily build a 32 or 64bit version of a function.
Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.
Remember that Microsoft was initially "officially noncommittal toward the early AMD64 effort" (from "A History of Modern 64-bit Computing" by Matthew Kerner and Neil Padgett) because they were strong partners with Intel on the IA64 architecture. I think that this meant that even if they would have otherwise been open to working with GCC engineers on a ABI to use both on Unix and Windows, they wouldn't have done so as it would mean publicly supporting the AMD64 effort when they hadn't yet officially done so (and would have probably upset Intel).
On top of that, back in those days Microsoft had absolutely no leanings toward being friendly with open source projects. Certainly not Linux or GCC.
So why would they have cooperated on an ABI? I'd guess that the ABIs are different simply because they were designed at more or less the same time and in isolation.
Another quote from "A History of Modern 64-bit Computing":
In parallel with the Microsoft collaboration, AMD also engaged the open source community to prepare for the chip. AMD contracted with both Code Sorcery and SuSE for tool chain work (Red Hat was already engaged by Intel on the IA64 tool chain port). Russell explained that SuSE produced C and FORTRAN compilers, and Code Sorcery produced a Pascal compiler. Weber explained that the company also engaged with the Linux community to prepare a Linux port. This effort was very important: it acted as an incentive for Microsoft to continue to invest in the AMD64 Windows effort, and also ensured that Linux, which was becoming an important OS at the time, would be available once the chips were released.
Weber goes so far as to say that the Linux work was absolutely crucial to AMD64’s success, because it enabled AMD to produce an end-to-end system without the help of any other companies if necessary. This possibility ensured that AMD had a worst-case survival strategy even if other partners backed out, which in turn kept the other partners engaged for fear of being left behind themselves.
This indicates that even AMD didn't feel that cooperation was necessarily the most important thing between MS and Unix, but that having Unix/Linux support was very important. Maybe even trying to convince one or both sides to compromise or cooperate wasn't worth the effort or risk(?) of irritating either of them? Perhaps AMD thought that even suggesting a common ABI might delay or derail the more important objective of simply having software support ready when the chip was ready.
Speculation on my part, but I think the major reason the ABIs are different was the political reason that MS and the Unix/Linux sides just didn't work together on it, and AMD didn't see that as a problem.
Win32 has its own uses for ESI and EDI, and requires that they not be modified (or at least that they be restored before calling into the API). I'd imagine 64-bit code does the same with RSI and RDI, which would explain why they're not used to pass function arguments around.
I couldn't tell you why RCX and RDX are switched, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With