Fast signed 16-bit divide by 7 for 6502

Question

I am working on an assembly language program for a 6502 cpu, and am finding that I need a fast-as-possible divide-by-seven routine, in particular one which could take a 16-bit dividend.

I am familiar with the routines found here, but generalizing the divide-by-seven routine found there is quite complicated, and a cursory examination of the general algorithm (using integer division)

x/7 ~= (x + x/8 + x/64 ... )/8

indicates that to handle a 16-bit range, it would likely take over 100 cycles to complete because of the 6502's single accumulator register and the relative slowness of individual memory bit shifts on the 6502.

I thought that a lookup table might help, but on the 6502, I am of course restricted to lookup tables that are 256 bytes or fewer. To that end one could assume the existence of two 256-byte lookup tables, xdiv7 and xmod7, which, when using an unsigned one-byte value as an index into the table, can quickly obtain the result of a byte divided by 7 or modulo 7, respectively. I am not sure how I could utilize these to find the values for the full 16-bit range, however.

In parallel, I also need a modulo 7 algorithm, although ideally whatever solution that can be worked out with the division will produce a mod7 result as well. If additional precomputed tables are needed, I am amenable to adding those, as long as the total memory requirements for all tables does not exceed about 3k.

Although I ultimately need a signed division algorithm, an unsigned one would suffice, because I can generalize that to a signed routine as needed.

Any assistance would be greatly appreciated.

Guillermo Phillips · Accepted Answer

A different way to do this would be to convert the division into a multiplication.

To work out the multiplication factor, we are interested in taking the reciprocal. Essentially we do:

d = n*(1/7)

To make things more accurate we multiply up by by a convenient power of 2. 2^16 works well:

d = floor(n*floor(65536/7)/65536)

The multiplication factor is: floor(65536/7) which is 9362. The size of the result will be:

ceiling(log2(65535*9362)) = 30 bits (4 bytes rounded up)

We can then discard the lower two bytes to divide by 65536, or simply just use the upper 2 bytes for the end result.

To work out the actual rotates and adds we need, we examine the binary representation of the factor 9362:

10010010010010

Note the repetition of the bit pattern. So an efficient scheme would be to calculate:

((n*9*256/4 + n*9)*8 + n)*2 = 9362*n

Calculating n*9 only needs ceiling(log2(65535*9)) = 20 bits (3 bytes).

In pseudo assembly this is:

LDA number          ; lo byte
STA multiply_nine
LDA number+1        ; high byte
STA multiply_nine+1
LDA #0              
STA multiply_nine+2 ; 3 byte result

ASL multiply_nine   ; multiply by 2
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine   ; multiply by 2 (4)
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine   ; multiply by 2 (8)
ROL multiply_nine+1
ROL mulitply_nine+2

CLC                 ; not really needed as result is only 20 bits, carry always zero
LDA multiply_nine
ADC number
STA multiply_nine
LDA multiply_nine+1
ADC number+1
STA multiply_nine+1
LDA multiply_nine+2
ADC #0
STA multiply_nine+2 ; n*9

The rest of the exercise I leave to the OP. Note it's not necessary to multiply by 256, as this is just a whole byte shift up.

user3386109 · Answer

Note: As @Damien_The_Unbeliever pointed out in the comments, the upperHigh and lowerLow tables are identical. So they can be combined into a single table. However, that optimization will make the code harder to read, and the explanation harder to write, so combining the tables is left as an exercise for the reader.

The code below shows how to generate the quotient and remainder when dividing a 16-bit unsigned value by 7. The simplest way to explain the code (IMO) is with an example, so let's consider dividing 0xa732 by 7. The expected result is:

quotient = 0x17e2
remainder = 4

We start by considering the input as two 8-bit values, the upper byte and the lower byte. The upper byte is 0xa7 and the lower byte is 0x32.

We compute a quotient and remainder from the upper byte:

0xa700 / 7 = 0x17db
0xa700 % 7 = 3

So we need three tables:

upperHigh stores the high byte of the quotient: upperHigh[0xa7] = 0x17
upperLow stores the low byte of the quotient: upperLow[0xa7] = 0xdb
upperRem stores the remainder: upperRem[0xa7] = 3

And we compute the quotient and remainder from the lower byte:

0x32 / 7 = 0x07
0x32 % 7 = 1

So we need two tables:

lowerLow stores the low byte of the quotient: lowerLow[0x32] = 0x07
lowerRem stores the remainder: lowerRem[0x32] = 1

Now we need to assemble the final answer. The remainder is the sum of two remainders. Since each remainder is in the range [0,6] the sum is in the range [0,12]. So we can use two 13 byte lookups to convert the sum into a final remainder and a carry.

The low byte of the quotient is the sum of that carry and the values from the lowerLow and upperLow tables. Note that the sum may generate a carry into the high byte.

The high byte of the quotient is the sum of that carry and the value from the upperHigh table.

So, to complete the example:

remainder = 1 + 3 = 4              // simple add (no carry in)
lowResult = 0x07 + 0xdb = 0xe2     // add with carry from remainder
highResult = 0x17                  // add with carry from lowResult

The assembly code to implement this consists of 7 table lookups, an add-without-carry instruction and two add-with-carry instructions.

#include <stdio.h>
#include <stdint.h>

uint8_t upperHigh[256];  // index:(upper 8 bits of the number)  value:(high 8 bits of the quotient)
uint8_t upperLow[256];   // index:(upper 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t upperRem[256];   // index:(upper 8 bits of the number)  value:(remainder when dividing the upper bits by 7)
uint8_t lowerLow[256];   // index:(lower 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t lowerRem[256];   // index:(lower 8 bits of the number)  value:(remainder when dividing the lower bits by 7)
uint8_t carryRem[13]    = { 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 };
uint8_t combinedRem[13] = { 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5 };

void populateLookupTables(void)
{
    for (uint16_t i = 0; i < 256; i++)
    {
        uint16_t upper = i << 8;
        upperHigh[i] = (upper / 7) >> 8;
        upperLow[i] = (upper / 7) & 0xff;
        upperRem[i] = upper % 7;

        uint16_t lower = i;
        lowerLow[i] = lower / 7;
        lowerRem[i] = lower % 7;
    }
}

void divideBy7(uint8_t upperValue, uint8_t lowerValue, uint8_t *highResult, uint8_t *lowResult, uint8_t *remainder)
{
    uint8_t temp = upperRem[upperValue] + lowerRem[lowerValue];
    *remainder = combinedRem[temp];
    *lowResult = upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp];
    uint8_t carry = (upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp]) >> 8;  // Note this is just the carry flag from the 'lowResult' calcaluation
    *highResult = upperHigh[upperValue] + carry;
}

int main(void)
{
    populateLookupTables();

    uint16_t n = 0;
    while (1)
    {
        uint8_t upper = n >> 8;
        uint8_t lower = n & 0xff;

        uint16_t quotient1  = n / 7;
        uint16_t remainder1 = n % 7;

        uint8_t high, low, rem;
        divideBy7(upper, lower, &high, &low, &rem);
        uint16_t quotient2 = (high << 8) | low;
        uint16_t remainder2 = rem;

        printf("n=%u q1=%u r1=%u q2=%u r2=%u", n, quotient1, remainder1, quotient2, remainder2);
        if (quotient1 != quotient2 || remainder1 != remainder2)
            printf(" **** failed ****");
        printf("
");

        n++;
        if (n == 0)
            break;
    }
}

alvalongo · Answer

At Unsigned Integer Division Routines for 8-bit division by 7:

;Divide by 7 (From December '84 Apple Assembly Line)
;15 bytes, 27 cycles
  sta  temp
  lsr
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr

The estimate of about 100 cycles with shifts was pretty accurate: 104 cycles to the last ror, 106 cycles total not including rts, 112 cycles for the whole function.
NOTE: after assembly for C64 and using VICE emulator for C64 I found the algorithm fails, for example 65535 gives 9343 and the correct answer is 9362.

   ; for 16 bit division  by 7
   ; input:
  ;   register A is low byte
  ;   register X is high byte
  ; output 
  ;   register A is low byte
  ;   register X is high byte
  ;
  ; memory on page zero
  ; temp     is on page zero, 2 bytes
  ; aHigh    is on page zero, 1 byte
  --
  sta temp
  stx temp+1
  stx aHigh
  --
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  ---
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  --
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a     -- 104 cycles
  ;-------
  ldx aHigh  ; -- 106
  rts        ; -- 112 cycles

Fast signed 16-bit divide by 7 for 6502

Tags:

assembly

integer-division

division

micro-optimization

6502

markt1964

3 Answers

Guillermo Phillips

user3386109

alvalongo

Recent Activity

Donate For Us

Fast signed 16-bit divide by 7 for 6502

Tags:

assembly

integer-division

division

micro-optimization

6502

markt1964

3 Answers

Guillermo Phillips

user3386109

alvalongo

Related questions

Recent Activity

Donate For Us