Focussing on worst case cycle count, I've coded integer multiplication routines for Atmel's AVR architecture.
In one particular implementation, I'm stuck with 2+1 worst cases, for each of which I seek a faster implementation. These multiply multiplicands with an even number of bytes with known values of an 8-bit part of the multiplier:* 11001011
(20310)* 10101011
(17110)* 10101101
(17310)
GCC (4.8.1) computes these as *29*7, *19*9, and *(43*4+1) - a nice fit for a 3-address machine, which the tinyAVR isn't (quite: most have register pair move twice as fast as add). For a two byte multiplicand & product, this uses 9+2, 10+2, and 11+2 additions(&subtractions) and moves, respectively, for 20, 22, and 24 cycles. Radix-4 Booth would use 11+1 additions (under not exactly comparable conditions) and 23 cycles.
For reasons beyond this question, I have 16*multiplicand precomputed (a5:a4
, 7 cycles including move); both original and shifted multiplicand might be used later on (but for the MSByte). And the product is initialised to the multiplicand for the following assembler code snippets (in which I use a Booth-style recoding notation: .
for NOP, +
, and -
. owing
is a label "one instruction before done
", performing a one-cycle fix-up):
locb:; .-..+.++ ..--.-.- ++..++.- ++.+.-.- ..--.-.- 29*7
; WC?!? 11001011 s 18
add p0, p0;15 n 16 a4 15 s 16 n 15 s0 17
adc p1, p1
sub p0, a4;13 d 13 z 13 s0 15 s4 12 d 15
sbc p1, a5
add p0, p0;11 s4 11 d 12 z 13 z 10 s0 13
adc p1, p1
add p0, a0; 9 d 9 aZ 10 a4 12 s0 9 z 11
adc p1, a1
add p0, p0; 7 s4 7 d 8 a4 10 d 7 d 10
adc p1, p1
add p0, a0; 5 s0 5 d 6 d 8 az 5 aZ 8
adc p1, a1
rjmp owing; 3 owi 3 s0 4 d 6 owi 3 d 6
; aZ 4 aZ 4
(The comments are, from left to right, a cycle count ("backwards from reaching done
"), and further code sequences using the recodings in the same column in "the label line", using a shorthand of n
for negate, d
for double (partial) product, a0
/a4
/s0
/s4
for adding or subtracting the multiplicand shifted 0 or 4 bits to the left, z
for storing the product in ZL:ZH, and aZ
/sZ
for using that.)
The other worst cases using macros (and the above-mentioned shorthand):
loab:; .-.-.-.- +.++.-.- +.+.+.++ .-.-+.++ +.+.++.-
; WC 10101011
negP ;15 set4 ;16 add4 ;15 d 16 a4 16
sub4 ;12 doubleP ;15 pp2Z ;13 s4 14 d 14
pp2Z ;10 subA ;13 doubleP ;12 z 12 a0 12
doubleP ; 9 pp2Z ;11 doubleP ;10 d 11 d 10
doubleP ; 7 doubleP ;10 addZ ; 8 d 9 a4 8
addZ ; 5 doubleP ; 8 doubleP ; 6 aZ 7 d 6
owing ; 3 addZ ; 6 addA ; 4 a0 5 s0 4
; add4 ; 4
loac:; +.+.++.. ++-.++.. .-.-.-..
load: ; .-.-..-- .-.-.-.+ .--.++.+ 0.9 1.8 0.8 (avg)
; WC 10101101
negP ;15 -1 negP ;16 a4 a0 a0 17
sub4 ;12 -16-1 sub4 ;13 d s4 a0
pp2Z ;10 -16-1 doubleP ;11 a0 Z s4
sub4 ; 9 -32-1 doubleP ; 9 d d d
doubleP ; 7 -64-2 sub4 ; 7 a4 aZ d
addZ ; 5 -80-3 addA ; 5 d d a0
owing ; 3 owing ; 3 a0 a0 s4
(I'm not looking for more than one of the results at any single time/as the result of any single invocation - but if you have a way to get two in less than 23 cycles or all three in less than 26, please let me know. To substantiate my claim to know about CSE, (re)using the notations [rl]sh16 and add16 introduced by vlad_tepesch:
movw C, x ; 1
lsh16(C) ; 3 C=2x
movw A, C ; 4
swap Al ; 5
swap Ah ; 6
cbr Ah, 15 ; 7
add Ah, Al ; 8
cbr Al, 15 ; 9
sub Ah, Al ;10 A=32x
movw X, A ;11
add16(X, C) ;13 X=34x
movw B, X ;14
lsh16(X) ;16
lsh16(X) ;18 X=136X
add16(B, X) ;20 B=170X
add16(B, x) ;22 B=171X
add16(A, B) ;24 A=203X
add16(C, B) ;26 C=173X
‐ notice that the 22 cycles to the first result are just the same old 20 cycles, plus two register pair moves. The sequence of actions beside those is that of the third column/alternative following the label loab
above.)
While 20 cycles (15-2(rjmp)+7(*16) doesn't look that bad, these are the worst cases. On an AVR CPU without mul-instruction,
How can one multiply real fast by each of 203, 171, and 173?
(Moving one case just before done
or owing
, having the other two of these faster would shorten the critical path/improve worst case cycle count.)
I am not very familiar with the avr-asm but i knwo the AVRs quite well, so i give it a try
If you need this products at the same place you could use common intermediate results and try adding multiples of 2.
(a) 203: +128 +64 +8 +2 +1 = +128 +32 +8 +2 +1 +32
(b) 171: +128 +32 +8 +2 +1
(c) 173: +128 +32 +8 +4 +1 = +128 +32 +8 +2 +1 +2
The key is that 16bit right shift and 16 addition have to be efficient. I do not know if i overseen something, but:
rsh16 (X):
LSR Xh
ROR Xl
and
add16 (Y,X)
ADD Yl, Xl
ADDC Yh, Xh
Both 2 cycles.
One register pair holds the current x*2^n value (Xh, Xl). and 3 other pairs (Ah, Ab, Bh, Bl, Ch, Cl) hold the results.
1. Xh <- x; Xl <- 0 (X = 256*x)
2. rsh16(X) (X = 128*x)
3. B = X (B = 128*x)
4. rsh16(X); rsh16(X) (X = 32*x)
5. add16(B, X) (B = 128*x + 32*x)
6. A = X (A = 32*X)
7. rsh16(X); rsh16(X) (X = 8*x)
8. add16(B, X) (B = 128*x + 32*x+ 8*x)
9. rsh16(X); rsh16(X) (X = 2*x)
10. add16(B, X) (B = 128*x + 32*x + 8*x + 2*x)
11. C = X (C = 2*X)
12. CLR Xh (clear Xh so we only add the carry below)
add Bl, x
addc Bh, Xh (B = 128*x + 32*x + 8*x + 2*x + x)
13. add16(A, B) (A = 32*X + B)
14. add16(C, B) (C = 2*X + B)
If I am correct this would sum up to 32 cycles for all three multiplications and requires 9 Registers (1 in, 6 out, 2 temporary)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With