<h3>Background</h3> I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class. <h3>The Setup</h3> The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of <code>src</code> starting at index <code>s_idx</code> and ending (exclusively) at index <code>e_idx</code> into the pre-allocated <code>dest</code> using only inline assembly. <pre class="prettyprint"><code>static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) { asm("addq %q2, %%rsi;" /* Add start index to src (ptrs are 64-bit) */ "subl %k2, %%ecx;" /* Get length of substr as e - s (int is 32-bit) */ "cld;" /* Clear direction bit (force increment) */ "rep movsb;" /* Move %ecx bytes of str at %esi into str at %edi */ : /* No Ouputs */ : "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx) : "cc", "memory" ); return dest; } </code></pre> The issue with this code is the constraint for the second input parameter. When compiled with <code>gcc</code>s default optimization and <code>-ggdb</code>, the following assembly is generated: <pre class="prettyprint"><code>Dump of assembler code for function asm_sub_str: 0x00000000004008e6 <+0>: push %rbp 0x00000000004008e7 <+1>: mov %rsp,%rbp 0x00000000004008ea <+4>: mov %rdi,-0x8(%rbp) 0x00000000004008ee <+8>: mov %rsi,-0x10(%rbp) 0x00000000004008f2 <+12>: mov %edx,-0x14(%rbp) 0x00000000004008f5 <+15>: mov %ecx,-0x18(%rbp) 0x00000000004008f8 <+18>: mov -0x10(%rbp),%rax 0x00000000004008fc <+22>: mov -0x8(%rbp),%rdx 0x0000000000400900 <+26>: mov -0x18(%rbp),%ecx 0x0000000000400903 <+29>: mov %rax,%rsi 0x0000000000400906 <+32>: mov %rdx,%rdi 0x0000000000400909 <+35>: add -0x14(%rbp),%rsi 0x000000000040090d <+39>: sub -0x14(%rbp),%ecx 0x0000000000400910 <+42>: cld 0x0000000000400911 <+43>: rep movsb %ds:(%rsi),%es:(%rdi) 0x0000000000400913 <+45>: mov -0x8(%rbp),%rax 0x0000000000400917 <+49>: pop %rbp 0x0000000000400918 <+50>: retq </code></pre> This is identical to the assembly that is generated when the second input parameter's constraint is set to <code>"m"</code> instead of <code>"g"</code>, leading me to believe the compiler is effectively choosing the <code>"m"</code> constraint. In stepping through these instructions with gdb, I found that the offending instruction is <code>+35</code> which adds the starting offset index <code>s_idx</code> to the <code>src</code> pointer in <code>%rsi</code>. The problem of course is that <code>s_idx</code> is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of <code>%rsi</code> which leads to a segfault in instruction <code>+43</code>. <h3>The Question</h3> Of course the solution to the above is to change the constraint of parameter <code>2</code> to <code>"r"</code> so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the <code>"g"</code> constraint as <code>"m"</code> instead of <code>"r"</code> in this case when the expression <code>"%q2"</code> indicates the value of parameter <code>2</code> will be used as a 64-bit value? I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of <code>s_idx</code> to a <code>long</code> when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change <code>"g" (s_idx)</code> to <code>"g" ((long) s_idx)</code>, gcc then resolves the <code>"g"</code> constraint to <code>"r"</code> since <code>(long) s_idx</code> is a temporary value. I would think gcc could do that implicitly as well?

<blockquote> but I would think that gcc could recognize the effectively implicit cast of <code>s_idx</code> to a <code>long</code> when it's used as a 64-bit value in the first inline instruction. </blockquote> No, gcc only looks at the constraints, not the <code>asm</code> template string at all, when compiling the surrounding code. The part of gcc that fills in the <code>%</code> template operands is totally separate from register-allocation and code-gen for the surrounding code. Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with <code>vmovd %k[input], %%xmm0</code> / <code>vpbroadcastw %%xmm0, %%ymm0</code>. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use <code>vmovd</code> instead of <code>vpinsrw $0, %[input], %%xmm0</code>, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like <code>"# low word of input = %h2 \n</code>. GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast <code>s_idx</code> to <code>long</code>. You don't need to cast the input for ECX, because the <code>sub</code> instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive. Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use <code>__attribute__((noinline))</code>, and as I said it wouldn't help anyway.) <hr> <blockquote> leading me to believe the compiler is effectively choosing the "m" constraint. </blockquote> Yes, <code>gcc -O0</code> spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the <code>asm</code> statement, at <code>-O0</code>. (clang is bad at multiple-option constraints and picks memory even at <code>-O3</code>, even when that means spilling first, but gcc doesn't have that problem.) <code>gcc -O0</code> (and <code>clang</code>) will use an immediate for a <code>g</code> constraint when the input is a numeric literal constant, e.g. <code>"g" (1234)</code>. In your case, you get: <pre class="prettyprint"><code> ... addq $1234, %rsi; subl $1234, %ecx; rep movsb ... </code></pre> An input like <code>"g" ((long)s_idx)</code> will use a register even at <code>-O0</code>, just like <code>x+y</code> or any other temporary result (as long as <code>s_idx</code> isn't already <code>long</code>). Interestingly, even <code>(unsigned)</code> resulted in a register operand, even though <code>int</code> and <code>unsigned</code> are the same size and the cast takes no instructions. At this point you're seeing exactly how little <code>gcc -O0</code> optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient. <hr> Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output. Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a <code>q</code> override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at <code>-O3</code> can be useful, too (especially if you test with some compile-time-constant inputs). <hr> <h3>Your code is seriously broken</h3> Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it. Fixing this by making some of them <code>"+"</code> read/write outputs means your asm statement is no longer <code>volatile</code> by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the <code>return dest</code> is sufficient for the standalone version, but not after inlining if the caller ignores the return value.) You did use a <code>"memory"</code> clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like <code>"m" (*(const char (*)[]) src)</code> <pre class="prettyprint"><code>char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) { asm ( "addq %[s_idx], %%rsi; \n\t" /* Add start index to src (ptrs are 64-bit) */ "subl %k[s_idx], %%ecx; \n\t" /* Get length of substr as e - s (int is 32-bit) */ // the calling convention requires DF=0, and inline-asm can safely assume it, too // (it's widely done, including in the Linux kernel) //"cld;" /* Clear direction bit (force increment) */ "rep movsb; \n\t" /* Move %ecx bytes of str at %esi into str at %edi */ : [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx) , "=m" (*(char (*)[]) dest) // dummy output: all of dest : [s_idx]"g" ((long long)s_idx) , "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity] : "cc" ); return 0; // asm statement not optimized away, even without volatile, // because of the memory output. // Just like dest++; could optimize away, but *dest = 0; couldn't. } </code></pre> formatting: note the use of <code>\n\t</code> at the end of each line for readability; otherwise the asm instructions are all on one line separated only by <code>;</code>. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.) This compiles (with gcc -O3) to <pre class="prettyprint"><code>asm_sub_str_fancyconstraints: movslq %edx, %rdx # from the (long long)s_idx xorl %eax, %eax # from the return 0, which I changed to test that it doesn't optimize away addq %rdx, %rsi; subl %edx, %ecx; # your code zero-extends (e_idx - s_idx) rep movsb; ret </code></pre> I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a <code>"memory"</code> clobber + <code>asm volatile</code> to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written. <hr> Early clobber: Note the <code>"+&S"</code> constraint: If for some weird reason, the compiler knew that the <code>src</code> address and <code>s_idx</code> were equal, it could use the same register (<code>esi/rsi</code>) for both inputs. This would lead to modifying <code>s_idx</code> before it was used in the <code>sub</code>. Declaring that the register holding <code>src</code> is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers. See the Godbolt link above for a caller that causes breakage without the <code>&</code> for early-clobber. (But only with the nonsensical <code>src = (char*)s_idx;</code>). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.) I omitted the early-clobber declaration from <code>e_idx</code> in <code>ecx</code>, because the only "free" parameter is <code>s_idx</code>, and putting them both in the same register will result in <code>sub same,same</code>, and <code>rep movsb</code> running 0 iterations as desired. <hr> It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to <code>rep movsb</code> in the right registers. Especially if both <code>e_idx</code> and <code>s_idx</code> are compile-time constants, it's silly to force the compiler to <code>mov</code> an immediate to a register and then subtract another immediate. Or even better, don't use inline asm at all. (But if you really want <code>rep movsb</code> to test its performance, inline asm is one way to do it. gcc also has tuning options that control how <code>memcpy</code> inlines, if at all.) No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.

GCC inline assembly: "g" constraint and parameter size

Background

I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class.

The Setup

The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of src starting at index s_idx and ending (exclusively) at index e_idx into the pre-allocated dest using only inline assembly.

static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) {
  asm("addq %q2, %%rsi;"  /* Add start index to src (ptrs are 64-bit) */
      "subl %k2, %%ecx;"  /* Get length of substr as e - s (int is 32-bit) */
      "cld;"              /* Clear direction bit (force increment) */
      "rep movsb;"        /* Move %ecx bytes of str at %esi into str at %edi */
      : /* No Ouputs */
      : "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx)
      : "cc", "memory"
      );

  return dest;
}

The issue with this code is the constraint for the second input parameter. When compiled with gccs default optimization and -ggdb, the following assembly is generated:

Dump of assembler code for function asm_sub_str:
   0x00000000004008e6 <+0>:     push   %rbp
   0x00000000004008e7 <+1>:     mov    %rsp,%rbp
   0x00000000004008ea <+4>:     mov    %rdi,-0x8(%rbp)
   0x00000000004008ee <+8>:     mov    %rsi,-0x10(%rbp)
   0x00000000004008f2 <+12>:    mov    %edx,-0x14(%rbp)
   0x00000000004008f5 <+15>:    mov    %ecx,-0x18(%rbp)
   0x00000000004008f8 <+18>:    mov    -0x10(%rbp),%rax
   0x00000000004008fc <+22>:    mov    -0x8(%rbp),%rdx
   0x0000000000400900 <+26>:    mov    -0x18(%rbp),%ecx
   0x0000000000400903 <+29>:    mov    %rax,%rsi
   0x0000000000400906 <+32>:    mov    %rdx,%rdi
   0x0000000000400909 <+35>:    add    -0x14(%rbp),%rsi
   0x000000000040090d <+39>:    sub    -0x14(%rbp),%ecx
   0x0000000000400910 <+42>:    cld    
   0x0000000000400911 <+43>:    rep movsb %ds:(%rsi),%es:(%rdi)
   0x0000000000400913 <+45>:    mov    -0x8(%rbp),%rax
   0x0000000000400917 <+49>:    pop    %rbp
   0x0000000000400918 <+50>:    retq

This is identical to the assembly that is generated when the second input parameter's constraint is set to "m" instead of "g", leading me to believe the compiler is effectively choosing the "m" constraint. In stepping through these instructions with gdb, I found that the offending instruction is +35 which adds the starting offset index s_idx to the src pointer in %rsi. The problem of course is that s_idx is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of %rsi which leads to a segfault in instruction +43.

The Question

Of course the solution to the above is to change the constraint of parameter 2 to "r" so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the "g" constraint as "m" instead of "r" in this case when the expression "%q2" indicates the value of parameter 2 will be used as a 64-bit value?

I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change "g" (s_idx) to "g" ((long) s_idx), gcc then resolves the "g" constraint to "r" since (long) s_idx is a temporary value. I would think gcc could do that implicitly as well?

700

asked Jan 21 '18 01:01

define cindy const

1 Answers

but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction.

No, gcc only looks at the constraints, not the asm template string at all, when compiling the surrounding code. The part of gcc that fills in the % template operands is totally separate from register-allocation and code-gen for the surrounding code.

Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0 / vpbroadcastw %%xmm0, %%ymm0. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd instead of vpinsrw $0, %[input], %%xmm0, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n.

GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx to long.

You don't need to cast the input for ECX, because the sub instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.

Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline)), and as I said it wouldn't help anyway.)

leading me to believe the compiler is effectively choosing the "m" constraint.

Yes, gcc -O0 spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm statement, at -O0.

(clang is bad at multiple-option constraints and picks memory even at -O3, even when that means spilling first, but gcc doesn't have that problem.)

gcc -O0 (and clang) will use an immediate for a g constraint when the input is a numeric literal constant, e.g. "g" (1234). In your case, you get:

    ...
    addq $1234, %rsi; 
    subl $1234, %ecx; 
    rep movsb
    ...

An input like "g" ((long)s_idx) will use a register even at -O0, just like x+y or any other temporary result (as long as s_idx isn't already long). Interestingly, even (unsigned) resulted in a register operand, even though int and unsigned are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0 optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.

Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.

Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3 can be useful, too (especially if you test with some compile-time-constant inputs).

Your code is seriously broken

Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.

Fixing this by making some of them "+" read/write outputs means your asm statement is no longer volatile by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)

You did use a "memory" clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)

char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
  asm (
      "addq %[s_idx], %%rsi; \n\t"  /* Add start index to src (ptrs are 64-bit) */
      "subl %k[s_idx], %%ecx;          \n\t"  /* Get length of substr as e - s (int is 32-bit) */

      // the calling convention requires DF=0, and inline-asm can safely assume it, too
      // (it's widely done, including in the Linux kernel)
      //"cld;"              /* Clear direction bit (force increment) */

      "rep movsb;                \n\t"        /* Move %ecx bytes of str at %esi into str at %edi */
      : [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
        , "=m" (*(char (*)[]) dest)     // dummy output: all of dest
      : [s_idx]"g" ((long long)s_idx)
        , "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
      : "cc"
      );

  return 0; // asm statement not optimized away, even without volatile,
            //  because of the memory output.
            // Just like dest++; could optimize away, but *dest = 0; couldn't.
}

formatting: note the use of \n\t at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)

This compiles (with gcc -O3) to

asm_sub_str_fancyconstraints:
    movslq  %edx, %rdx        # from the (long long)s_idx
    xorl    %eax, %eax        # from the return 0, which I changed to test that it doesn't optimize away
    addq %rdx, %rsi; 
    subl %edx, %ecx;          # your code zero-extends (e_idx - s_idx)
    rep movsb;                

    ret

I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory" clobber + asm volatile to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.

Early clobber: Note the "+&S" constraint:

If for some weird reason, the compiler knew that the src address and s_idx were equal, it could use the same register (esi/rsi) for both inputs. This would lead to modifying s_idx before it was used in the sub. Declaring that the register holding src is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.

See the Godbolt link above for a caller that causes breakage without the & for early-clobber. (But only with the nonsensical src = (char*)s_idx;). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)

I omitted the early-clobber declaration from e_idx in ecx, because the only "free" parameter is s_idx, and putting them both in the same register will result in sub same,same, and rep movsb running 0 iterations as desired.

It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb in the right registers. Especially if both e_idx and s_idx are compile-time constants, it's silly to force the compiler to mov an immediate to a register and then subtract another immediate.

Or even better, don't use inline asm at all. (But if you really want rep movsb to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy inlines, if at all.)

No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.

117

answered Oct 11 '22 16:10

Peter Cordes

Related questions
                            
                                How to execute multiple artisan commands without waiting for one to end execution?
                            
                                rendering first step of multistep form wizard as partial in another controller's show action
                            
                                Difference between getExternalStorageDirectory() and getDataDirectory()
                            
                                How to match template friend function in a template class
                            
                                What is the best DynamoDB architecture for storing several document types?
                            
                                Bootstrap: Control multiple tab panels with single tab nav
                            
                                Pygame Error "No available video device" on online 'repl.it'
                            
                                How to handle missing JSON data when creating objects? (Typescript)
                            
                                Are static methods close to pure methods?
                            
                                How to show kable kableExtra tables in R Markdown notebook, not just in the knitted doc?
                            
                                How to export data from SQL table to multiple excel files in SSIS?
                            
                                Performing a fuzzy contains check

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With