Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient small-word-size multiply for processors without a hardware multiplier

I'm hoping to use the CH32V003 (an RV32EC processor) to do ColorChord, which makes extensive use of multiply-add's to perform DFTs. But it can operate with very low bit depths, 16- or even 8-bit multiplies. But, the RV32EC in the CH32V003 doesn't support the RV32 multiply extension.

I've tried exploring options in godbolt, see https://godbolt.org/z/zqTEaeecr to see what the compiler would do in these situations, but it seems to only call __mulsi3, which performs a naive 32-bit multiply. https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/epiphany/mulsi3.c

What I'm hoping is that there's some ultra efficient route to do something like a combined multiply-and-shift for different situations.

Is there a good guide or discussion surrounding performing extremely efficient multiplies of special combinations of bit widths and signeness for architectures that don't have hardware multiply?

like image 696
Charles Lohr Avatar asked Dec 05 '25 20:12

Charles Lohr


1 Answers

You've got 16kB of flash available. Why don't you use 1kB for storing a "squares/4" table such as...

const uint16_t  Sqr_4[511]={0/4,1/4, 4/4, 9/4, 16/4, 25/4, ..., 260100/4};

uint16_t umul8b( uint8_t x, uint8_t y){

   return Sqr_4[(uint16_t)x+y]-((x>y)?Sqr_4[x-y]:Sqr_4[y-x]);
}
like image 90
Nikola Anderbaum Avatar answered Dec 08 '25 23:12

Nikola Anderbaum



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!