Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this? For example to get 45x you would do <pre class="prettyprint"><code>leal (%edi, %edi, 8), %edi leal (%edi, %edi, 4), %eax (to be returned) </code></pre> I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x

At <code>-O3</code>, gcc will emit (Godbolt compiler explorer): <pre class="prettyprint"><code>int mul37(int a) { return a*37; } leal (%rdi,%rdi,8), %eax # eax = a * 9 leal (%rdi,%rax,4), %eax # eax = a + 4*(a*9) ret </code></pre> That's using <code>37 = 9*4 + 1</code>, not destroying the original <code>a</code> value with the first <code>lea</code> so it can use both in the 2nd. You're in good company in not spotting this one, though: recent clang (3.8 and newer) will normally use 2 <code>lea</code> instructions instead of an <code>imul</code> (e.g. for <code>*15</code>), but it misses this one and uses: <pre class="prettyprint"><code> imull $37, %edi, %eax ret </code></pre> It does do <code>*21</code> with the same pattern as gcc uses, as <code>5*4 + 1</code>. (clang3.6 and earlier always used <code>imul</code> unless there was a single-instruction alternative <code>shl</code> or <code>lea</code>) ICC and MSVC also use imul, but they don't seem to like using 2 <code>lea</code> instructions, so the <code>imul</code> is "on purpose" there. See the godbolt link for a variety of multipliers with gcc7.2 vs. clang5.0. It's interesting to try <code>gcc -m32 -mtune=pentium</code> or even <code>pentium3</code> to see how many more instructions gcc was wiling to use back then. Although P2/P3 has 4-cycle latency for <code>imul r, r, i</code>, so that's kinda crazy. Pentium has 9 cycle <code>imul</code> and no OOO to hide the latency, so it makes sense to try hard to avoid it. <code>mtune=silvermont</code> should probably only be willing to replace 32-bit <code>imul</code> with a single instruction, because it has 3-cycle latency / 1c throughput multiply, but decode is often the bottleneck (according to Agner Fog, http://agner.org/optimize/). You could even consider <code>imul $64, %edi, %eax</code> (or other powers of 2) instead of <code>mov</code>/<code>shl</code>, because imul-immediate is a copy-and-multiply. <hr> Ironically, <code>gcc</code> misses the <code>* 45</code> case, and uses <code>imul</code>, while clang uses 2 <code>lea</code>s. Guess it's time to file some missed-optimization bug reports. If 2 LEAs are better than 1 IMUL, they should be used wherever possible. Older clang (3.7 and older) uses <code>imul</code> unless a single <code>lea</code> will do the trick. I haven't looked up the changelog to see if they did benchmarks to decide to favour latency over throughput. <hr> Related: Using LEA on values that aren't addresses / pointers? canonical answer about why LEA uses memory-operand syntax and machine encoding, even though it's a shift+add instruction (and runs on an ALU, not AGU, in most modern microarchitectures.)

How to multiply a register by 37 using only 2 consecutive leal instructions in x86?

Tags:

x86

assembly

x86-64

multiplication

strength-reduction

Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this?

For example to get 45x you would do

leal (%edi, %edi, 8), %edi   
leal (%edi, %edi, 4), %eax (to be returned)

I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x

788

asked Sep 29 '17 01:09

Newbie18

1 Answers

At -O3, gcc will emit (Godbolt compiler explorer):

int mul37(int a)  { return a*37; }

    leal    (%rdi,%rdi,8), %eax      # eax = a * 9
    leal    (%rdi,%rax,4), %eax      # eax = a + 4*(a*9)
    ret

That's using 37 = 9*4 + 1, not destroying the original a value with the first lea so it can use both in the 2nd.

You're in good company in not spotting this one, though: recent clang (3.8 and newer) will normally use 2 lea instructions instead of an imul (e.g. for *15), but it misses this one and uses:

    imull   $37, %edi, %eax
    ret

It does do *21 with the same pattern as gcc uses, as 5*4 + 1. (clang3.6 and earlier always used imul unless there was a single-instruction alternative shl or lea)

ICC and MSVC also use imul, but they don't seem to like using 2 lea instructions, so the imul is "on purpose" there.

See the godbolt link for a variety of multipliers with gcc7.2 vs. clang5.0. It's interesting to try gcc -m32 -mtune=pentium or even pentium3 to see how many more instructions gcc was wiling to use back then. Although P2/P3 has 4-cycle latency for imul r, r, i, so that's kinda crazy. Pentium has 9 cycle imul and no OOO to hide the latency, so it makes sense to try hard to avoid it.

mtune=silvermont should probably only be willing to replace 32-bit imul with a single instruction, because it has 3-cycle latency / 1c throughput multiply, but decode is often the bottleneck (according to Agner Fog, http://agner.org/optimize/). You could even consider imul $64, %edi, %eax (or other powers of 2) instead of mov/shl, because imul-immediate is a copy-and-multiply.

Ironically, gcc misses the * 45 case, and uses imul, while clang uses 2 leas. Guess it's time to file some missed-optimization bug reports. If 2 LEAs are better than 1 IMUL, they should be used wherever possible.

Older clang (3.7 and older) uses imul unless a single lea will do the trick. I haven't looked up the changelog to see if they did benchmarks to decide to favour latency over throughput.

Related: Using LEA on values that aren't addresses / pointers? canonical answer about why LEA uses memory-operand syntax and machine encoding, even though it's a shift+add instruction (and runs on an ALU, not AGU, in most modern microarchitectures.)

answered Sep 23 '22 07:09

Peter Cordes

Related questions
                            
                                Square root function in Forth using x86 Assembly?
                            
                                ARM Simulator on Windows
                            
                                (meaningful) cost of the jump instruction?
                            
                                Linking Android C-code and ARM Assembler
                            
                                How to call assembly in gdb?
                            
                                Accessing one character in a string
                            
                                why does vs c++ 2010 compiler produce a different assembly code for similar function
                            
                                What happens when a rep-prefix is attached to a non string instruction?
                            
                                How to use the APIC to create IPIs to wake the APs for SMP in x86 assembly?
                            
                                How to print a number in ARM assembly?
                            
                                Why does Visual Studio increment the loop pointer before dereferencing it?
                            
                                Translating single C line to MIPS Assembly
                            
                                Can I add 64bit constants to 64bit registers?
                            
                                Using interrupt 0x80 on 64-bit Linux [duplicate]
                            
                                Performance discrepancy in compiled vs. hand-written assembly
                            
                                Why does x86 architecture use two stack registers (esp ; ebp)?
                            
                                How to print a string to the terminal in x86-64 assembly (NASM) without syscall?
                            
                                What are the conditional jump instructions for Go's assembler?
                            
                                Does using mix of pxor and xorps affect performance?
                            
                                How to disassemble a shellcode into assembly instruction? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With