I am working on an assembly language program for a 6502 cpu, and am finding that I need a fast-as-possible divide-by-seven routine, in particular one which could take a 16-bit dividend.
I am familiar with the routines found here, but generalizing the divide-by-seven routine found there is quite complicated, and a cursory examination of the general algorithm (using integer division)
x/7 ~= (x + x/8 + x/64 ... )/8
indicates that to handle a 16-bit range, it would likely take over 100 cycles to complete because of the 6502's single accumulator register and the relative slowness of individual memory bit shifts on the 6502.
I thought that a lookup table might help, but on the 6502, I am of course restricted to lookup tables that are 256 bytes or fewer. To that end one could assume the existence of two 256-byte lookup tables, xdiv7 and xmod7, which, when using an unsigned one-byte value as an index into the table, can quickly obtain the result of a byte divided by 7 or modulo 7, respectively. I am not sure how I could utilize these to find the values for the full 16-bit range, however.
In parallel, I also need a modulo 7 algorithm, although ideally whatever solution that can be worked out with the division will produce a mod7 result as well. If additional precomputed tables are needed, I am amenable to adding those, as long as the total memory requirements for all tables does not exceed about 3k.
Although I ultimately need a signed division algorithm, an unsigned one would suffice, because I can generalize that to a signed routine as needed.
Any assistance would be greatly appreciated.
A different way to do this would be to convert the division into a multiplication.
To work out the multiplication factor, we are interested in taking the reciprocal. Essentially we do:
d = n*(1/7)
To make things more accurate we multiply up by by a convenient power of 2. 2^16 works well:
d = floor(n*floor(65536/7)/65536)
The multiplication factor is: floor(65536/7) which is 9362. The size of the result will be:
ceiling(log2(65535*9362)) = 30 bits (4 bytes rounded up)
We can then discard the lower two bytes to divide by 65536, or simply just use the upper 2 bytes for the end result.
To work out the actual rotates and adds we need, we examine the binary representation of the factor 9362:
10010010010010
Note the repetition of the bit pattern. So an efficient scheme would be to calculate:
((n*9*256/4 + n*9)*8 + n)*2 = 9362*n
Calculating n*9 only needs ceiling(log2(65535*9)) = 20 bits (3 bytes).
In pseudo assembly this is:
LDA number ; lo byte
STA multiply_nine
LDA number+1 ; high byte
STA multiply_nine+1
LDA #0
STA multiply_nine+2 ; 3 byte result
ASL multiply_nine ; multiply by 2
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine ; multiply by 2 (4)
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine ; multiply by 2 (8)
ROL multiply_nine+1
ROL mulitply_nine+2
CLC ; not really needed as result is only 20 bits, carry always zero
LDA multiply_nine
ADC number
STA multiply_nine
LDA multiply_nine+1
ADC number+1
STA multiply_nine+1
LDA multiply_nine+2
ADC #0
STA multiply_nine+2 ; n*9
The rest of the exercise I leave to the OP. Note it's not necessary to multiply by 256, as this is just a whole byte shift up.
Note: As @Damien_The_Unbeliever pointed out in the comments, the upperHigh
and lowerLow
tables are identical. So they can be combined into a single table. However, that optimization will make the code harder to read, and the explanation harder to write, so combining the tables is left as an exercise for the reader.
The code below shows how to generate the quotient and remainder when dividing a 16-bit unsigned value by 7. The simplest way to explain the code (IMO) is with an example, so let's consider dividing 0xa732
by 7. The expected result is:
quotient = 0x17e2
remainder = 4
We start by considering the input as two 8-bit values, the upper
byte and the lower
byte. The upper
byte is 0xa7
and the lower
byte is 0x32
.
We compute a quotient and remainder from the upper
byte:
0xa700 / 7 = 0x17db
0xa700 % 7 = 3
So we need three tables:
upperHigh
stores the high byte of the quotient: upperHigh[0xa7] = 0x17
upperLow
stores the low byte of the quotient: upperLow[0xa7] = 0xdb
upperRem
stores the remainder: upperRem[0xa7] = 3
And we compute the quotient and remainder from the lower
byte:
0x32 / 7 = 0x07
0x32 % 7 = 1
So we need two tables:
lowerLow
stores the low byte of the quotient: lowerLow[0x32] = 0x07
lowerRem
stores the remainder: lowerRem[0x32] = 1
Now we need to assemble the final answer. The remainder is the sum of two remainders. Since each remainder is in the range [0,6] the sum is in the range [0,12]. So we can use two 13 byte lookups to convert the sum into a final remainder and a carry.
The low byte of the quotient is the sum of that carry and the values from the lowerLow
and upperLow
tables. Note that the sum may generate a carry into the high byte.
The high byte of the quotient is the sum of that carry and the value from the upperHigh
table.
So, to complete the example:
remainder = 1 + 3 = 4 // simple add (no carry in)
lowResult = 0x07 + 0xdb = 0xe2 // add with carry from remainder
highResult = 0x17 // add with carry from lowResult
The assembly code to implement this consists of 7 table lookups, an add-without-carry instruction and two add-with-carry instructions.
#include <stdio.h>
#include <stdint.h>
uint8_t upperHigh[256]; // index:(upper 8 bits of the number) value:(high 8 bits of the quotient)
uint8_t upperLow[256]; // index:(upper 8 bits of the number) value:(low 8 bits of the quotient)
uint8_t upperRem[256]; // index:(upper 8 bits of the number) value:(remainder when dividing the upper bits by 7)
uint8_t lowerLow[256]; // index:(lower 8 bits of the number) value:(low 8 bits of the quotient)
uint8_t lowerRem[256]; // index:(lower 8 bits of the number) value:(remainder when dividing the lower bits by 7)
uint8_t carryRem[13] = { 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 };
uint8_t combinedRem[13] = { 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5 };
void populateLookupTables(void)
{
for (uint16_t i = 0; i < 256; i++)
{
uint16_t upper = i << 8;
upperHigh[i] = (upper / 7) >> 8;
upperLow[i] = (upper / 7) & 0xff;
upperRem[i] = upper % 7;
uint16_t lower = i;
lowerLow[i] = lower / 7;
lowerRem[i] = lower % 7;
}
}
void divideBy7(uint8_t upperValue, uint8_t lowerValue, uint8_t *highResult, uint8_t *lowResult, uint8_t *remainder)
{
uint8_t temp = upperRem[upperValue] + lowerRem[lowerValue];
*remainder = combinedRem[temp];
*lowResult = upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp];
uint8_t carry = (upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp]) >> 8; // Note this is just the carry flag from the 'lowResult' calcaluation
*highResult = upperHigh[upperValue] + carry;
}
int main(void)
{
populateLookupTables();
uint16_t n = 0;
while (1)
{
uint8_t upper = n >> 8;
uint8_t lower = n & 0xff;
uint16_t quotient1 = n / 7;
uint16_t remainder1 = n % 7;
uint8_t high, low, rem;
divideBy7(upper, lower, &high, &low, &rem);
uint16_t quotient2 = (high << 8) | low;
uint16_t remainder2 = rem;
printf("n=%u q1=%u r1=%u q2=%u r2=%u", n, quotient1, remainder1, quotient2, remainder2);
if (quotient1 != quotient2 || remainder1 != remainder2)
printf(" **** failed ****");
printf("\n");
n++;
if (n == 0)
break;
}
}
At Unsigned Integer Division Routines for 8-bit division by 7:
;Divide by 7 (From December '84 Apple Assembly Line)
;15 bytes, 27 cycles
sta temp
lsr
lsr
lsr
adc temp
ror
lsr
lsr
adc temp
ror
lsr
lsr
The estimate of about 100 cycles with shifts was pretty accurate: 104 cycles to the last ror, 106 cycles total not including rts
, 112 cycles for the whole function.
NOTE: after assembly for C64 and using VICE emulator for C64 I found the algorithm fails, for example 65535 gives 9343 and the correct answer is 9362.
; for 16 bit division by 7
; input:
; register A is low byte
; register X is high byte
; output
; register A is low byte
; register X is high byte
;
; memory on page zero
; temp is on page zero, 2 bytes
; aHigh is on page zero, 1 byte
--
sta temp
stx temp+1
stx aHigh
--
lsr aHigh
ror a
lsr aHigh
ror a
lsr aHigh
ror a
---
adc temp
tax
lda aHigh
adc temp+1
sta aHigh
txa
--
ror aHigh
ror a
lsr aHigh
ror a
lsr aHigh
ror a
--
adc temp
tax
lda aHigh
adc temp+1
sta aHigh
txa
--
ror aHigh
ror a
lsr aHigh
ror a
lsr aHigh
ror a -- 104 cycles
;-------
ldx aHigh ; -- 106
rts ; -- 112 cycles
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With