Most efficient small-word-size multiply for processors without a hardware multiplier

Question

I'm hoping to use the CH32V003 (an RV32EC processor) to do ColorChord, which makes extensive use of multiply-add's to perform DFTs. But it can operate with very low bit depths, 16- or even 8-bit multiplies. But, the RV32EC in the CH32V003 doesn't support the RV32 multiply extension.

I've tried exploring options in godbolt, see https://godbolt.org/z/zqTEaeecr to see what the compiler would do in these situations, but it seems to only call __mulsi3, which performs a naive 32-bit multiply. https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/epiphany/mulsi3.c

What I'm hoping is that there's some ultra efficient route to do something like a combined multiply-and-shift for different situations.

Is there a good guide or discussion surrounding performing extremely efficient multiplies of special combinations of bit widths and signeness for architectures that don't have hardware multiply?

Nikola Anderbaum · Accepted Answer

You've got 16kB of flash available. Why don't you use 1kB for storing a "squares/4" table such as...

const uint16_t  Sqr_4[511]={0/4,1/4, 4/4, 9/4, 16/4, 25/4, ..., 260100/4};

uint16_t umul8b( uint8_t x, uint8_t y){

   return Sqr_4[(uint16_t)x+y]-((x>y)?Sqr_4[x-y]:Sqr_4[y-x]);
}

Most efficient small-word-size multiply for processors without a hardware multiplier

Tags:

assembly

embedded

multiplication

riscv

Charles Lohr

1 Answers

Nikola Anderbaum

Recent Activity

Donate For Us

Most efficient small-word-size multiply for processors without a hardware multiplier

Tags:

assembly

embedded

multiplication

riscv

Charles Lohr

1 Answers

Nikola Anderbaum

Related questions

Recent Activity

Donate For Us