I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class.
The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of src
starting at index s_idx
and ending (exclusively) at index e_idx
into the pre-allocated dest
using only inline assembly.
static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) {
asm("addq %q2, %%rsi;" /* Add start index to src (ptrs are 64-bit) */
"subl %k2, %%ecx;" /* Get length of substr as e - s (int is 32-bit) */
"cld;" /* Clear direction bit (force increment) */
"rep movsb;" /* Move %ecx bytes of str at %esi into str at %edi */
: /* No Ouputs */
: "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx)
: "cc", "memory"
);
return dest;
}
The issue with this code is the constraint for the second input parameter. When compiled with gcc
s default optimization and -ggdb
, the following assembly is generated:
Dump of assembler code for function asm_sub_str:
0x00000000004008e6 <+0>: push %rbp
0x00000000004008e7 <+1>: mov %rsp,%rbp
0x00000000004008ea <+4>: mov %rdi,-0x8(%rbp)
0x00000000004008ee <+8>: mov %rsi,-0x10(%rbp)
0x00000000004008f2 <+12>: mov %edx,-0x14(%rbp)
0x00000000004008f5 <+15>: mov %ecx,-0x18(%rbp)
0x00000000004008f8 <+18>: mov -0x10(%rbp),%rax
0x00000000004008fc <+22>: mov -0x8(%rbp),%rdx
0x0000000000400900 <+26>: mov -0x18(%rbp),%ecx
0x0000000000400903 <+29>: mov %rax,%rsi
0x0000000000400906 <+32>: mov %rdx,%rdi
0x0000000000400909 <+35>: add -0x14(%rbp),%rsi
0x000000000040090d <+39>: sub -0x14(%rbp),%ecx
0x0000000000400910 <+42>: cld
0x0000000000400911 <+43>: rep movsb %ds:(%rsi),%es:(%rdi)
0x0000000000400913 <+45>: mov -0x8(%rbp),%rax
0x0000000000400917 <+49>: pop %rbp
0x0000000000400918 <+50>: retq
This is identical to the assembly that is generated when the second input parameter's constraint is set to "m"
instead of "g"
, leading me to believe the compiler is effectively choosing the "m"
constraint. In stepping through these instructions with gdb, I found that the offending instruction is +35
which adds the starting offset index s_idx
to the src
pointer in %rsi
. The problem of course is that s_idx
is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of %rsi
which leads to a segfault in instruction +43
.
Of course the solution to the above is to change the constraint of parameter 2
to "r"
so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the "g"
constraint as "m"
instead of "r"
in this case when the expression "%q2"
indicates the value of parameter 2
will be used as a 64-bit value?
I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of s_idx
to a long
when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change "g" (s_idx)
to "g" ((long) s_idx)
, gcc then resolves the "g"
constraint to "r"
since (long) s_idx
is a temporary value. I would think gcc could do that implicitly as well?
GCC provides two forms of inline asm statements. A basic asm statement is one with no operands (see Basic Asm), while an extended asm statement (see Extended Asm) includes one or more operands.
The __asm keyword invokes the inline assembler and can appear wherever a C or C++ statement is legal. It cannot appear by itself. It must be followed by an assembly instruction, a group of instructions enclosed in braces, or, at the very least, an empty pair of braces.
The typical use of extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations. See Volatile. inline.
In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.
but I would think that gcc could recognize the effectively implicit cast of
s_idx
to along
when it's used as a 64-bit value in the first inline instruction.
No, gcc only looks at the constraints, not the asm
template string at all, when compiling the surrounding code. The part of gcc that fills in the %
template operands is totally separate from register-allocation and code-gen for the surrounding code.
Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0
/ vpbroadcastw %%xmm0, %%ymm0
. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd
instead of vpinsrw $0, %[input], %%xmm0
, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n
.
GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx
to long
.
You don't need to cast the input for ECX, because the sub
instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.
Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline))
, and as I said it wouldn't help anyway.)
leading me to believe the compiler is effectively choosing the "m" constraint.
Yes, gcc -O0
spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm
statement, at -O0
.
(clang is bad at multiple-option constraints and picks memory even at -O3
, even when that means spilling first, but gcc doesn't have that problem.)
gcc -O0
(and clang
) will use an immediate for a g
constraint when the input is a numeric literal constant, e.g. "g" (1234)
. In your case, you get:
...
addq $1234, %rsi;
subl $1234, %ecx;
rep movsb
...
An input like "g" ((long)s_idx)
will use a register even at -O0
, just like x+y
or any other temporary result (as long as s_idx
isn't already long
). Interestingly, even (unsigned)
resulted in a register operand, even though int
and unsigned
are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0
optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.
Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.
Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q
override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3
can be useful, too (especially if you test with some compile-time-constant inputs).
Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.
Fixing this by making some of them "+"
read/write outputs means your asm statement is no longer volatile
by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest
is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)
You did use a "memory"
clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)
char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
asm (
"addq %[s_idx], %%rsi; \n\t" /* Add start index to src (ptrs are 64-bit) */
"subl %k[s_idx], %%ecx; \n\t" /* Get length of substr as e - s (int is 32-bit) */
// the calling convention requires DF=0, and inline-asm can safely assume it, too
// (it's widely done, including in the Linux kernel)
//"cld;" /* Clear direction bit (force increment) */
"rep movsb; \n\t" /* Move %ecx bytes of str at %esi into str at %edi */
: [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
, "=m" (*(char (*)[]) dest) // dummy output: all of dest
: [s_idx]"g" ((long long)s_idx)
, "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
: "cc"
);
return 0; // asm statement not optimized away, even without volatile,
// because of the memory output.
// Just like dest++; could optimize away, but *dest = 0; couldn't.
}
formatting: note the use of \n\t
at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;
. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)
This compiles (with gcc -O3) to
asm_sub_str_fancyconstraints:
movslq %edx, %rdx # from the (long long)s_idx
xorl %eax, %eax # from the return 0, which I changed to test that it doesn't optimize away
addq %rdx, %rsi;
subl %edx, %ecx; # your code zero-extends (e_idx - s_idx)
rep movsb;
ret
I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory"
clobber + asm volatile
to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.
Early clobber: Note the "+&S"
constraint:
If for some weird reason, the compiler knew that the src
address and s_idx
were equal, it could use the same register (esi/rsi
) for both inputs. This would lead to modifying s_idx
before it was used in the sub
. Declaring that the register holding src
is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.
See the Godbolt link above for a caller that causes breakage without the &
for early-clobber. (But only with the nonsensical src = (char*)s_idx;
). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)
I omitted the early-clobber declaration from e_idx
in ecx
, because the only "free" parameter is s_idx
, and putting them both in the same register will result in sub same,same
, and rep movsb
running 0 iterations as desired.
It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb
in the right registers. Especially if both e_idx
and s_idx
are compile-time constants, it's silly to force the compiler to mov
an immediate to a register and then subtract another immediate.
Or even better, don't use inline asm at all. (But if you really want rep movsb
to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy
inlines, if at all.)
No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With