Assembly 8086 - Implementing any multiplication and division without MUL and DIV instruction

Tags:

I would like to know if there is a way to perform any multiplication or division without use of MUL or DIV instruction because they require a lot of CPU cycles. Can I exploit SHL or SHR instructions for this target? How can I implement the assembly code?

552

asked Jan 13 '15 12:01

Develobeer

2 Answers

Just like everything else in assembly there are many ways to do multiplication and division.

Do division by multiplying by the reciprocal value.
Use shifts and adds/subs instead of multiplication.
Use the address calculation options of lea (multiplication only).

Myth busting

because they require a lot of CPU cycles

MUL and IMUL are blazingly fast on modern CPU's, see: http://www.agner.org/optimize/instruction_tables.pdf
DIV and IDIV are and always have been exceedingly slow.

An example for Intel Skylake (page 217):

MUL, IMUL r64: Latency 3 cycles, reciprocal throughput 1 cycle.

Note that this is the maximum latency to multiply two 64 ! bit values.
The CPU can complete one of these multiplications every CPU cycle if all it's doing is multiplications.
If you consider that the above example using shifts and adds to multiply by 7 has a latency of 4 cycles (3 using lea). There is no real way to beat a plain multiply on a modern CPU.

Multiplication by the reciprocal

According to Agner Fog's asm lib instruction page 12:

Division is slow on most microprocessors. In floating point calculations, we can do multiple divisions with the same divisor faster by multiplying with the reciprocal, for example:
float a, b, d;  
a /= d; b /= d;   
can be changed to:
float a, b, d, r;   
r = 1.0f / d;   
a *= r; b *= r;   
If we want to do something similar with integers then we have to scale the reciprocal divisor by 2n and then shift n places to the right after the multiplication.

Multiplying by the reciprocal works well when you need to divide by a constant or if you divide by the same variable many times in a row.
You can find really cool assembly code demonstrating the concept in Agner Fog's assembly library.

Shifts and adds/subs
A shift right is a divide by two shr - (Reduce).
A shift left is a multiply by two shl - (Larger).
You can add and substract to correct for non-powers of two along the way.

//Multiply by 7
mov ecx,eax
shl eax,3    //*8
sub eax,ecx  //*7

Division other than by powers of 2 using this method gets complex quickly.
You may wonder why I'm doing the operations in a weird order, but I'm trying to make the dependency chain as short as possible to maximize the number of instructions that can be executed in parallel.

Using Lea
Lea is an instruction to calculate address offsets.
It can calculate multiples of 2,3,4,5,8, and 9 in a single instruction.
Like so:

                      //Latency on AMD CPUs (K10 and later, including Jaguar and Zen)
                      //On Intel all take 1 cycle.
lea eax,[eax+eax]     //*2     1 cycle      
lea eax,[eax*2+eax]   //*3     2 cycles
lea eax,[eax*4]       //*4     2 cycles   more efficient: shl eax,2 (1 cycle)
lea eax,[eax*4+eax]   //*5     2 cycles 
lea eax,[eax*8]       //*8     2 cycles   more efficient: shl eax,3 (1 cycle)
lea eax,[eax*8+eax]   //*9     2 cycles

Note however that lea with a multiplier (scale factor) is considered a 'complex' instruction on AMD CPUs from K10 to Zen and has a latency of 2 CPU cycles. On earlier AMD CPUs (k8), lea always has 2-cycle latency even with a simple [reg+reg] or [reg+disp8] addressing mode.

AMD
Agner Fog's instruction tables are wrong for AMD Zen: 3-component or scaled-index LEA is still 2 cycles on Zen (with only 2 per clock throughput instead of 4) according to InstLatx64 (http://instlatx64.atw.hu/). Also, like earlier CPUs, in 64-bit mode lea r32, [r64 + whatever] has 2 cycle latency. So it's actually faster to use lea rdx, [rax+rax] instead of lea edx, [rax+rax] on AMD CPUs, unlike Intel where truncating the result to 32 bits is free.

The *4 and *8 can be done faster using shl because a simple shift takes only a single cycle.

On the plus side, lea does not alter the flags and it allows a free move to another destination register. Because lea can only shift left by 0, 1, 2, or 3 bits (aka multiply by 1, 2, 4, or 8) these are the only breaks you get.

Intel
On Intel CPUs (Sandybridge-family), any 2-component LEA (only one +) has single-cycle latency. So lea edx, [rax + rax*4] has single-cycle latency, but lea edx, [rax + rax + 12] has 3 cycle latency (and worse throughput). An example of this tradeoff is discussed in detail in C++ code for testing the Collatz conjecture faster than hand-written assembly - why?.

answered Sep 28 '22 10:09

Johan

Things like SHL/SHR, SAL/SAR, ADD/SUB are faster than MUL and DIV, but MUL and DIV work better for dynamic numbers. For example, if you know that you just need to divide by two, then it's a single-bit shift right. But if you don't know in advance the number, then you might be tempted to repeatedly SUB the values. For example, To determine AX divided by BX, you could just constantly subtract BX from AX until BX is > AX, keeping track of the count. But if you were dividing by 200, by 1 that would mean 200 loops and SUB operations.

MUL and DIV will work better in most cases when the numbers involved aren't hard-coded and known in advance. The only exceptions I can think of is when you know it's something like a multiple/divide by 2, 4, 8, etc. where the Shift operators will work fine.

answered Sep 28 '22 08:09

SomeNYCGuy

Related questions
                            
                                Dereferencing a label in x86 assembly
                            
                                How does assembly language interact with something like the Internet?
                            
                                Are extended instruction sets (SSE, MMX) used in Linux kernel?
                            
                                gcc argument register spilling on x86-64
                            
                                Confused with CMPSB instruction
                            
                                Adding floating point/double numbers in assembly
                            
                                $zero on MIPS really hardware zero?
                            
                                Segmentation Fault in Assembly Language
                            
                                Porting Assembler x86 CPU ID code to AMD64
                            
                                why clear interrput flag cause segmentation fault in C?
                            
                                stp aarch64 instruction must be used with "non-contiguous pair of registers"
                            
                                Call C kernel from assembly bootloader
                            
                                Loading a byte from a DB 1, but printf integer show a large number
                            
                                Why is QtSPIM telling me "Label is defined for the second time"?
                            
                                How does CPU reorder instructions
                            
                                How Switch Statement Works
                            
                                Why does main() compiled by D have a 32-bit return value on a 64-bit machine?
                            
                                x64 nasm: pushing memory addresses onto the stack & call function
                            
                                Far jump in ntdll.dll's internal ZwCreateUserProcess
                            
                                How can assembly work on an Operating System?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Assembly 8086 - Implementing any multiplication and division without MUL and DIV instruction

Tags:

assembly

division

multiplication

cpu-usage

Develobeer

People also ask

2 Answers

Johan

SomeNYCGuy

Recent Activity

Donate For Us