Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient computation of the high order bits of a 32 bit integer multiplication

Many CPUs have single assembly opcodes for returning the high order bits of a 32 bit integer multiplication. Normally multiplying two 32 bit integers produces a 64 bit result, but this is truncated to the low 32 bits if you store it in a 32 bit integer.

For example, on PowerPC, the mulhw opcode returns the high 32 bits of the 64 bit result of a 32x32 bit multiply in one clock. This is exactly what I'm looking for, but more portably. There's a similar opcode, umulhi(), in NVidia CUDA.

In C/C++, is there an efficient way to return the high order bits of the 32x32 multiply? Currently I compute it by casting to 64 bits, something like:

unsigned int umulhi32(unsigned int x, unsigned int y)
{
  unsigned long long xx=x;
  xx*=y;
  return (unsigned int)(xx>>32);
}

but this is over 11 times slower than a regular 32 by 32 multiply because I'm using overkill 64 bit math even for the multiply.

Is there a faster way to compute the high order bits?

This is clearly not best solved with a BigInteger library (which is overkill and will have huge overhead).

SSE seems to have PMULHUW, a 16x16 -> top 16 bit version of this, but not a 32x32 -> top 32 version like I'm looking for.

like image 509
SPWorley Avatar asked Sep 08 '09 23:09

SPWorley


People also ask

How many integers is 32 bits?

How many integers can be represented in 32 bits? Using 32 bits up to 4,294,967,296 pieces of unique data can be stored. As signed integers, the range is -2,147,483,648 to 2,147,483,647.

How many numbers can you store in 32 bits?

A 32-bit register can store 232 different values.

What is 32-bit integer in C?

32-bit signed integer type is used to store negativ or pozitiv whole number. 32-bit integer and his value range: from -2147483648 to 2147483647.


3 Answers

gcc 4.3.2, with -O1 optimisation or higher, translated your function exactly as you showed it to IA32 assembly like this:

umulhi32:
        pushl   %ebp
        movl    %esp, %ebp
        movl    12(%ebp), %eax
        mull    8(%ebp)
        movl    %edx, %eax
        popl    %ebp
        ret

Which is just doing a single 32 bit mull and putting the high 32 bits of the result (from %edx) into the return value.

That's what you wanted, right? Sounds like you just need to turn up the optimisation on your compiler ;) It's possible you could push the compiler in the right direction by eliminating the intermediate variable:

unsigned int umulhi32(unsigned int x, unsigned int y)
{
  return (unsigned int)(((unsigned long long)x * y)>>32);
}
like image 143
caf Avatar answered Oct 18 '22 21:10

caf


I don't think there's a way to do this in standard C/C++ better than what you already have. What I'd do is write up a simple assembly wrapper that returned the result you want.

Not that you're asking about Windows, but as an example even though Windows has an API that sounds like it does what you want (a 32 by 32 bit multiply while obtaining the full 64 bit result), it implements the multiply as a macro that does what you're doing:

#define UInt32x32To64( a, b ) (ULONGLONG)((ULONGLONG)(DWORD)(a) * (DWORD)(b))
like image 45
Michael Burr Avatar answered Oct 18 '22 21:10

Michael Burr


On 32 bit intel, a multiply affects two registers for the output. That is, the 64 bits are fully available, whether you want it or not. Its just a function of whether the compiler is smart enough to take advantage of it.

Modern compilers do amazing things, so my suggestion is to experiment with optimization flags some more, at least on Intel. You would think that the optimizer might know that the processor produces a 64 bit value from 32 by 32 bits.

That said, at some point I tried to get the compiler to use the modulo as well as the dividend on a division result, but the old Microsoft compiler from 1998 was not smart enough to realize the same instruction produced both results.

like image 2
Matthias Wandel Avatar answered Oct 18 '22 22:10

Matthias Wandel