Why does al contain the number of vector parameters in assembly?
Why are vector parameters any different from normal parameters for the callee?
The value is used for optimization as stated in the ABI document
The prologue should use
%al
to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.3.5.7 Variable Argument Lists - The Register Save Area. System V Application Binary Interface version 1.0
When you call va_start
it'll save all the parameters passed in registers to the register save area
To start, any function that is known to use
va_start
is required to, at the start of the function, save all registers that may have been used to pass arguments onto the stack, into the “register save area”, for future access byva_start
andva_arg
. This is an obvious step, and I believe pretty standard on any platform with a register calling convention. The registers are saved as integer registers followed by floating point registers...https://blog.nelhage.com/2010/10/amd64-and-va_arg/
But saving all 8 vector registers could be slow so the compiler may choose to optimize it using the value passed in al
... As an optimization, during a function call,
%rax
is required to hold the number of SSE registers used to hold arguments, to allow a varargs caller to avoid touching the FPU at all if there are no floating point arguments.https://blog.nelhage.com/2010/10/amd64-and-va_arg/
Since you want to save at least the registers used, the value can be larger than the real number of used registers. That's why there's this line in the ABI
The contents of
%al
do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.
You can see the effect from the prolog of ICC
sub rsp, 216 #5.1
mov QWORD PTR [8+rsp], rsi #5.1
mov QWORD PTR [16+rsp], rdx #5.1
mov QWORD PTR [24+rsp], rcx #5.1
mov QWORD PTR [32+rsp], r8 #5.1
mov QWORD PTR [40+rsp], r9 #5.1
movzx r11d, al #5.1
lea rax, QWORD PTR [r11*4] #5.1
lea r11, QWORD PTR ..___tag_value_varstrings(int, ...).6[rip] #5.1
sub r11, rax #5.1
lea rax, QWORD PTR [175+rsp] #5.1
jmp r11 #5.1
movaps XMMWORD PTR [-15+rax], xmm7 #5.1
movaps XMMWORD PTR [-31+rax], xmm6 #5.1
movaps XMMWORD PTR [-47+rax], xmm5 #5.1
movaps XMMWORD PTR [-63+rax], xmm4 #5.1
movaps XMMWORD PTR [-79+rax], xmm3 #5.1
movaps XMMWORD PTR [-95+rax], xmm2 #5.1
movaps XMMWORD PTR [-111+rax], xmm1 #5.1
movaps XMMWORD PTR [-127+rax], xmm0 #5.1
..___tag_value_varstrings(int, ...).6:
It's essentially a Duff's device. The r11
register is loaded with the address after the xmm saving instructions, and then al*4
is subtracted from the result (since movaps XMMWORD PTR [rax-X], xmmX
is 4 bytes long) to jump to the movaps
instruction that we should run
As I see, other compilers always save all the vector registers, or don't save them at all, so they don't care about al
's value and just check if it's zero
The general purpose registers are always saved, probably because it's cheaper to just move the 6 registers to memory instead of spending time for a condition check, address calculation and jump. As a result so you don't need a parameter for how many integers were passed in registers
Here is a similar question to yours. You can find more information in the below links
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With