I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20). To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15. The following naive C++ implementation executed a million times performs around 1s, measured with linux <code>time</code>. <pre class="prettyprint"><code>#include <random> int main() { //Random number generator std::random_device rd; std::mt19937 eng(rd()); std::uniform_int_distribution<> distr(0, 15); int A[3][3]; int B[3][3]; int C[3][3]; for (int trials = 0; trials <= 1000000; trials++) { //Set up A[] and B[] for (int i = 0; i < 3; ++i) { for (int j = 0; j < 3; ++j) { A[i][j] = distr(eng); B[i][j] = distr(eng); C[i][j] = 0; } } //Compute C[]=A[]*B[] for (int i = 0; i < 3; ++i) { for (int j = 0; j < 3; ++j) { for (int k = 0; k < 3; ++k) { C[i][j] = C[i][j] + A[i][k] * B[k][j]; } } } } return 0; } </code></pre> Notes: <ol> <li>The matrices are not necessarily sparse. </li> <li> Strassen-like comments does not help here.</li> <li>Let's try not to use the circumstantial observation, that in this specific problem the matrices <code>A[]</code> and <code>B[]</code> can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.</li> <li>Computation is single-threaded. <hr> </li> </ol> Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?

The question you linked is about a matrix where every element is a single bit. For one-bit values <code>a</code> and <code>b</code>, <code>a * b</code> is exactly equivalent to <code>a & b</code>. For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries. A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with <code>uint64_t</code>) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations. <hr> <h3>packed vs. unpacked-to-bytes storage formats</h3> If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector. Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array. <hr> Compilers suck at this with <code>uint8_t</code> in scalar C code for x86. See comments on @Richard's answer: gcc and clang both like to use <code>mul r8</code> for <code>uint8_t</code>, which forces them to move data into <code>eax</code> (the implicit input/output for a one-operand multiply), rather than using <code>imul r32, r32</code> and ignoring the garbage that leaves outside the low 8 bits of the destination register. The <code>uint8_t</code> version actually runs slower than the <code>uint16_t</code> version, even though it has half the cache footprint. <hr> <h3>You're probably going to get best results from some kind of SIMD.</h3> Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, <code>pshufb</code> can zero elements as well as copy them around. More likely to be useful is SSE2 <code>PMADDWD</code>, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, <code>pmaddwd</code> (<code>_mm_madd_epi16</code>) is one horizontal <code>add</code> away from giving you the dot-product result you need for <code>C[i][j]</code>. Instead of doing each of those adds separately, you can probably pack multiple <code>pmaddwd</code> results into a single vector so you can store <code>C[i][0..2]</code> in one go.

Fast integer matrix multiplication with bit-twiddling hacks

Q: What is the fastest way to convert integers to bits?

This is one operation faster than the obvious way, sign = - (v < 0). This trick works because when signed integers are shifted right, the value of the far left bit is copied to the other bits. The far left bit is 1 when the value is negative and 0 otherwise; all 1 bits gives -1. Unfortunately, this behavior is architecture-specific.

Q: How many steps does it take to multiply 32 bits?

Note that the last two steps can be combined on some processors because the registers can be accessed as bytes; just multiply so that a register stores the upper 32 bits of the result and the take the low byte. Thus, it may take only 6 operations. Devised by Sean Anderson, July 13, 2001.

Q: How many bits are in a 11 bit operation?

In 11 operations, this version interleaves bits of two bytes (rather than shorts, as in the other versions), but many of the operations are 64-bit multiplies so it isn't appropriate for all machines. The input parameters, x and y, should be less than 256.

Q: How do I extend a bit-width array in 3 operations?

If you know that your initial bit-width, b, is greater than 1, you might do this type of sign extension in 3 operations by using r = (x * multipliers [b]) / multipliers [b], which requires only one array lookup.

Tags:

c++

performance

algorithm

matrix-multiplication

I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20).

To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15.

The following naive C++ implementation executed a million times performs around 1s, measured with linux time.

#include <random>

int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);

int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
    //Set up A[] and B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            A[i][j] = distr(eng);
            B[i][j] = distr(eng);
            C[i][j] = 0;
        }
    }
    //Compute C[]=A[]*B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}
return 0;
}

Notes:

The matrices are not necessarily sparse.
Strassen-like comments does not help here.
Let's try not to use the circumstantial observation, that in this specific problem the matrices A[] and B[] can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.
Computation is single-threaded.

Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?

581

asked May 08 '16 10:05

Matsmath

1 Answers

The question you linked is about a matrix where every element is a single bit. For one-bit values a and b, a * b is exactly equivalent to a & b.

For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.

A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with uint64_t) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.

packed vs. unpacked-to-bytes storage formats

If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.

Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.

Compilers suck at this with uint8_t in scalar C code for x86. See comments on @Richard's answer: gcc and clang both like to use mul r8 for uint8_t, which forces them to move data into eax (the implicit input/output for a one-operand multiply), rather than using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register.

The uint8_t version actually runs slower than the uint16_t version, even though it has half the cache footprint.

You're probably going to get best results from some kind of SIMD.

Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb can zero elements as well as copy them around.

More likely to be useful is SSE2 PMADDWD, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd (_mm_madd_epi16) is one horizontal add away from giving you the dot-product result you need for C[i][j].

Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

183

answered Nov 05 '22 17:11

Peter Cordes

Related questions
                            
                                Perfect forwarding fails when target is aggregate with array
                            
                                Lowercase of Unicode character
                            
                                Send events from nodejs addon to javascript
                            
                                Function forwarding argument and simply doing nothing
                            
                                Interaction between default arguments and parameter pack (GCC and clang disagree)
                            
                                Should a compiler interpret an arbitrary non-zero value in bool as true correctly?
                            
                                Suggest to the compiler to selectively inline function calls
                            
                                Is mutex needed for different offsets into allocated heap memory
                            
                                Pointer to function pointer in C++
                            
                                trying to use std::get_time to parse YYMMDD and failing
                            
                                How does hyperthreading affect parallelization?
                            
                                alignas keyword not respected
                            
                                Boost Program_Options throws "character conversion failed"
                            
                                Troubles linking to a static library on windows with CMake
                            
                                Casting a function pointer into a noexcept specified function pointer
                            
                                Why no default hash for C++ POD structs?
                            
                                C++ - Put members in common for two sub classes
                            
                                Is using #ifndef and #define in C++ obsolete?
                            
                                Doxygen multi-line comment after variable
                            
                                LLVM optimizer can't handle simple case?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With