In Hacker's delight there is an algorithm to calculate the double word product of two (signed) words.
The function muldws1
uses four multiplications and five additions to calculate
the double word from two words.
Towards the end of that code there is a line commented out
/* w[1] = u*v; // Alternative. */
This alternative uses five multiplications and four addition, i.e. it exchanges an addition for a multiplication.
But I think this alternative method can be improved. I have not said anything about hardware yet. Let's assume a hypothetical CPU which can calculate the lower word of the product of two words but not the upper word (e.g. for 32-bit words 32x32 to lower 32). In this case it seems to me that this algorithm can be improved. Here is what I have come up with assuming 32-bit words (the same concept would work for 64-bit words).
void muldws1_improved(int w[], int32_t x, int32_t y) {
uint16_t xl = x; int16_t xh = x >> 16;
uint16_t yl = y; int16_t yh = y >> 16;
uint32 lo = x*y;
int32_t t = xl*yh + xh*yl;
uint16_t tl = t; int16_t th = t >>16;
uint16_t loh = lo >> 16;
int32_t cy = loh<tl; //carry
int32_t hi = xh*yh + th + cy;
w[0] = hi; w[1] = lo;
}
This uses four multiplications, three additions, and one comparison. This is a smaller improvement then I had hoped for.
Can this be improved? Is there a better way to determine the carry flag? I should point out I am also assuming the hardware has no carry flag (e.g. no ADDC instruction) but words can be compared (e.g. word1<word
).
Edit: as Sander De Dycker pointed out my function fails the unit tests. Here is a version which passes the unit tests but it's less efficient. I think it can be improved.
void muldws1_improved_v2(int w[], int32_t x, int32_t y) {
uint16_t xl = x; int16_t xh = x >> 16;
uint16_t yl = y; int16_t yh = y >> 16;
uint32_t lo = x*y;
int32_t t2 = xl*yh;
int32_t t3 = xh*yl;
int32_t t4 = xh*yh;
uint16_t t2l = t2; int16_t t2h = t2 >>16;
uint16_t t3l = t3; int16_t t3h = t3 >>16;
uint16_t loh = lo >> 16;
uint16_t t = t2l + t3l;
int32_t carry = (t<t2l) + (loh<t);
int32_t hi = t4 + t2h + t3h + carry;
w[0] = hi; w[1] = lo;
}
This uses four multiplications, five additions, and two comparisons which is worse that the original function.
There were two problems with my muldws1_improved
function in my question. One of them is that it missed a carry when I did xl*yh + xh*yl
. This is why it failed the unit tests. But the other is that there are signed*unsigned products which require a lot more machine logic than is seen in the C code. (see my edit below). I found a better solution which is to optimized the unsigned product function muldwu1 first and then do
muldwu1(w,x,y);
w[0] -= ((x<0) ? y : 0) + ((y<0) ? x : 0);
to correct for the sign.
Here is my attempt at improving the muldwu1
using the lower word lo = x*y
(yes this function passes the unit tests from Hacker's delight).
void muldwu1_improved(uint32_t w[], uint32_t x, uint32_t y) {
uint16_t xl = x; uint16_t xh = x >> 16;
uint16_t yl = y; uint16_t yh = y >> 16;
uint32_t lo = x*y; //32x32 to 32
uint32_t t1 = xl*yh; //16x16 to 32
uint32_t t2 = xh*yl; //16x16 to 32
uint32_t t3 = xh*yh; //16x16 to 32
uint32_t t = t1 + t2;
uint32_t tl = 0xFFFF & t;
uint32_t th = t >> 16;
uint32_t loh = lo >> 16;
uint32_t cy = ((t<t1) << 16) + (loh<tl); //carry
w[1] = lo;
w[0] = t3 + th + cy;
}
This uses one less addition than the original function from Hacker's delight but it has to do two comparisons
1 mul32x32 to 32
3 mul16x16 to 32
4 add32
5 shift logical (or shuffles)
1 and
2 compare32
***********
16 operations
Edit:
I was bothered by a statement in Hacker's Delight (2nd Edition) which says in regards to the mulhs and mulhu algorithm.
The algorithm requires 16 basic RISC instructions in either the signed or unsigned version, four of which are multiplications.
I implemented the unsigned algorithm in only 16 SSE instructions but my signed version required more instructions. I figured out why and I can now answer my own question.
The reason I failed to find a better version that in Hacker's Delight is that their hypothetical RISC processor has an instruction which calculates the lower word of the product of two words. In other words, their algorithm is already optimized for this case and so it's unlikely there is a better version than the one they already have.
The reason they list an alternative is because they assume multiplication (and division) may be more expensive than other instructions and so they left the alternative as a case to optimize on.
So the C code does not hide significant machine logic. It assumes the machine can do word * word to lower word.
Why does this matter? In their algorithm they do first
u0 = u >> 16;
and later
t = u0*v1 + k;
if u = 0x80000000
u0 = 0xffff8000
. However, if your CPU can only take half word products to get a full word the upper half word of u0
is ignored and you get the wrong signed result.
In this case you should calculate the unsigned upper word and then correct using hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0);
as I already stated.
The reason I am interested in this is that Intel's SIMD instruction (SSE2 through AVX2) do not have an instruction which does 64x64 to 64, they only have 32x32 to 64. That's why my signed version requires more instructions.
But AVX512 has a 64x64 to 64 instruction. Therefore with AVX512 the signed version should take the same number of instructions as the unsigned. However, since the 64x64 to 64 instruction may be much slower than the 32x32 to 64 instruction it may make more sense to do the unsigned version anyway and then correct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With