int read_val();
long read_and_process(int n) {
long vals[n];
for (int i = 0; i < n; i++)
vals[i] = read_val();
return vals[n-1];
}
The assembly language code compiled by x86-64 GCC 5.4 is:
read_and_process(int):
pushq %rbp
movslq %edi, %rax
>>> leaq 22(,%rax,8), %rax
movq %rsp, %rbp
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-16, %rax
leal -1(%rdi), %r13d
subq %rax, %rsp
testl %edi, %edi
movq %rsp, %r14
jle .L3
leal -1(%rdi), %eax
movq %rsp, %rbx
leaq 8(%rsp,%rax,8), %r12
movq %rax, %r13
.L4:
call read_val()
cltq
addq $8, %rbx
movq %rax, -8(%rbx)
cmpq %r12, %rbx
jne .L4
.L3:
movslq %r13d, %r13
movq (%r14,%r13,8), %rax
leaq -32(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %rbp
ret
Why is there a need to calculate 8*%rax+22 and then AND with -16, since there could be 8*%rax+16, which gives the same result and looks more natural?
Other assembly language code compiled by x86-64 GCC 11.2 looks almost the same, with the number 22 being replaced by 15. So is the number determined just by random, or because of some reasons?
VLA works by placing the array in the stack - stackoverflow.com/questions/2034712/variable-length-arrays. That is also what I see when checking the assembly output generated by gcc when using a VLA, no call to malloc . But it might depend on the actual implementation. This is an open source project.
How does it work? The VLA is an interferometer; this means that it operates by multiplying the data from each pair of telescopes together to form interference patterns.
In computer programming, a variable-length array (VLA), also called variable-sized or runtime-sized, is an array data structure whose length is determined at run time (instead of at compile time). In C, the VLA is said to have a variably modified type that depends on a value (see Dependent type).
The biggest problem is that one can not even check for failure as they could with the slightly more verbose malloc'd memory. Assumptions in the size of an array could be broken two years after writing perfectly legal C using VLAs, leading to possibly very difficult to find issues in the code.
Summary: The number is not random, it's part of a calculation that ensures proper stack alignment. The number should be 15, and the 22 is the result of a minor bug in old versions of GCC.
Recall that the x86-64 SysV ABI mandates 16-byte stack alignment; the stack pointer must be a multiple of 16 prior to any call
instruction. Hence when we enter read_and_process
, the stack pointer is 8 less than a multiple of 16, because the call
that got us here pushed 8 bytes. So prior to calling read_val()
, the stack pointer must be decremented by 8 more than a multiple of 16, i.e. an odd multiple of 8. The prologue pushes an odd number of registers (five, namely rbp, r14, r13, r12, rbx
) at 8 bytes each. So the remaining stack adjustment must be a multiple of 16.
So whatever amount of memory is to be allocated for the array vals
, it must be rounded up to a multiple of 16. A standard way to do that is to add 15, then AND with -16: adjusted = (orig + 15) & -16
.
Why does that work? -16
, thanks to two's complement arithmetic, has the low 4 bits clear and the others set, so AND with -16
results in a multiple of 16 - but since the AND clears low-order bits, the result of x & -16
is less than x
; this is rounding down. If we add 15 first (which is, of course, 1 less than 16), the net effect is to round up instead. Adding 15 to orig
will cause it to pass a multiple of 16, and then & -16
will round down to that multiple of 16. Unless orig
was already a multiple of 16, in which case orig+15
rounds down back to orig
itself. So this does the right thing in all cases.
That's what GCC does from 8.1.0 onward. Adding 15 is baked into the same lea
that multiplies n
by 8, and the AND with -16
comes several lines later.
In this case, since orig = 8*n
is already a multiple of 8, there are other values besides 15 that would work just as well; 8, for instance (though not 16, see below). But using 15 is completely equivalent mathematically and in terms of code size and speed, and since 15 works regardless of previous alignment, the compiler authors can just use 15 unconditionally without writing extra code to keep track of what alignment orig
may already have.
But adding 22 instead, as older GCC does, is clearly wrong. If orig
was already a multiple of 16, say orig = 32
, then orig+22
is 54, which rounds down to 48. But 32 bytes was already a perfectly good size so we just wasted 16 bytes for no reason. (Here orig
is 8*n
so this would occur if the input n
is even.) For similar reasons, your suggestion of using 16 instead of 22 would also be wrong.
So the 22 is a bug. It's a pretty minor bug; the resulting code still works just fine and complies with the ABI, and the only ill effect is that sometimes a little bit of stack space is wasted. But it was fixed for GCC 8.1.0 by a commit entitled "Improve alloca alignment". (alloca
is an old non-standard function that does dynamic stack allocation, and compiler writers often use the term to refer to any stack allocation.)
Apparently the issue was that some previous pass of the compiler had determined that the size needed to be aligned to (at least) 8 bytes, which would have been accomplished by adding 7 and ANDing with -8 (which might later be optimized away when the compiler realized later that n*8
is already aligned to 8 bytes). Now this constraint should be made redundant when the compiler realizes that 16-byte alignment is actually required, as every multiple of 16 is already a multiple of 8. But the compiler erroneously adds the offsets 7 and 15, when the right thing to do was to take their maximum (which is what the commit implemented). And 7 + 15 is... 22.
If you compile the code using GCC 5.4 with optimizations off, you can see these two operations happening separately:
lea rdx, [rax+7] ; add 7 to rax and write to rdx
mov eax, 16
sub rax, 1 ; now rax = 15
add rax, rdx ; add 15 to rdx
and with optimizations on, the optimizer combines these into a single add of 22 - without noticing that the add of 7 should not have been there to begin with. In newer versions of GCC with -O0
, the lea rdx, [rax+7]
is gone.
why there need to calculate 8*%rax+22 and then AND with -16, since there could be 8*%rax+16, which gives the same result and looks more naturely.
It does not give the same result. The expression ( ( rax*8 + 22 ) % -16 )
aligns output by 16 bytes.
On 64-bit CPUs, -16 is equivalent to 0xFFFFFFFFFFFFFFF0
When written that way, it’s obvious what the AND instruction is doing: it strips the four least significant bits from the value, and this makes the result aligned by 16 bytes, rounding down. The ( ( rax*8 + 15 ) % -16 )
expression results in the alignment by 16 bytes, rounding up. But the compiler wants eight more bytes of the alignment, because it pushed five values to the stack with five push
instructions, and each one is eight bytes.
Your next question is probably going to be “why align by 16 bytes when alignof(long)=8?”
The answer is the preferred-stack-boundary
compiler option. The option defaults to 4 in GCC, which means the compiler aligns stack frames by 2^4 = 16 bytes.
Try to compile the same code with -mpreferred-stack-boundary=3
(which, BTW, is the minimum allowed value for AMD64. It requires the alignment to be at least one pointer in size) and see what happens to the assembly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With