Binary matrix multiplication bit twiddling hack

Tags:

#Abstract

Hi, suppose you have two different independent 64-bit binary matrices A and T (T is another matrix that is stored in transposed form, using the transposed version of matrix allows during multiplication to operate on T's rows rather than columns which is super cool for binary arithmetic) and you want to multiply these matrices the only thing is that matrix multiplication result is truncated to 64-bits and if you yield to a value greater that 1 in some specific matrix cell the resulting matrix cell will contain 1 otherwise 0

#Example

   A        T
00000001 01111101 
01010100 01100101 
10010111 00010100 
10110000 00011000 <-- This matrix is transposed
11000100 00111110 
10000011 10101111 
11110101 11000100 
10100000 01100010

Binary and traditional multiplication results:

 Binary  Traditional
11000100  11000100
11111111  32212121
11111111  32213421
11111111  21112211
11101111  22101231
11001111  11001311
11111111  54213432
11001111  11001211

#Question

How do you multiply these matrices in a way described above in most efficient matter?

#P.S

I was trying to take advantage of binary and (i.e. & operator) instead of performing multiplication on separate bits, in that case I had to prepare data for multiplication:

ulong u;

u = T & 0xFF;
u = (u << 00) + (u << 08) + (u << 16) + (u << 24)
  + (u << 32) + (u << 40) + (u << 48) + (u << 56);

now by performing binary and over two integers A and u it would yield to the following:

   A        u        R        C
00000001 01111101 00000001    1
01010100 01111101 01010100    3
10010111 01111101 00010101    3
10110000 01111101 00110000    2
11000100 01111101 01000100    2
10000011 01111101 00000001    1
11110101 01111101 01110101    5
10100000 01111101 00100000    1

In the example above R contains result of multiplication of A bits to u bits and to obtain the final value we must sum all bits in a row. Notice that column C contains values equal to ones found in first column of resulting Traditional matrix multiplication above. The problem is that during this step I have to operate on a separate bits which I think is sub-optimal approach, I've read through http://graphics.stanford.edu/~seander/bithacks.html looking for a way to do that on parallel but no luck, if anyone has any idea on how to "flatten" and "merge" the values located in R column into resulting 64-bit matrix, I would appreciate if you drop me several lines,

Thank you,

#Edit

With big thank you to David Eisenstat, the final algorithm would then look like:

var A = ...;
var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix

var D = 0x8040201008040201UL;

U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D);

The following piece of code:

    public static void Main (string[] args){
        ulong U;
        var Random = new Xor128 ();

        var timer = DateTime.Now;

        var A = Random.As<IUniformRandom<UInt64>>().Evaluate();
        var T = Random.As<IUniformRandom<UInt64>>().Evaluate();

        var steps = 10000000;

        for (var i = 0; i < steps; i++) {
            ulong r = 0;

            var d = 0x8040201008040201UL;

            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d);
        }

        Console.WriteLine (DateTime.Now - timer);


        var m1 = new Int32[8,8];
        var m2 = new Int32[8,8];
        var m3 = new Int32[8,8];

        for (int row = 0; row < 8; row++) {
            for (int col = 0; col < 8; col++) {
                m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
                m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
                m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
            }
        }

        timer = DateTime.Now;

        for (int i = 0; i < steps; i++) {
            for (int row = 0; row < 8; row++) {
                for (int col = 0; col < 8; col++) {
                    var sum = 0;

                    for (int temp = 0; temp < 8; temp++) {
                        sum += m1 [row, temp] * m2 [temp, row];
                    }

                    m3 [row, col] = sum;
                }
            }
        }

        Console.WriteLine (DateTime.Now - timer);
    }

Shows me the following results:

00:00:02.4035870
00:00:57.5147150

And that's a 23x performance improvement under Mac OS X / Mono, thanks everyone

444

asked Aug 26 '13 15:08

Lu4

3 Answers

I'm not sure about most efficient, but here's something to try. The following sequence of instructions computes the main diagonal of the product A * T'. Rotate both T and D by 8 bits and repeat for 7 more iterations.

// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255;  // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;

101

answered Oct 13 '22 00:10

David Eisenstat

If you are looking for a way to do dense matrix multiplication in parallel, partition your result matrix into blocks and compute each block in parallel.

http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication

answered Oct 13 '22 01:10

evenex_code

It is not clear what data structure you are using, which language (yes, I know you said 'any language'), and what you are trying to optimize (speed? memory?) etc. All of these may have profound impact on your solution.

Some examples:

Say this was C/C++, and your matrices are continues bits in memory. Each row/column maps to a UINT8. In this case, multiplying a row with a column reduces to doing an 8-bit bitwise-&, and checking if the result is greater than 0 (no need to sum the bits). This takes 2 processor instruction.
If you are forced to do bit-by-bit operations, use the bitwise 'or' (|) instead of +. Some languages may lazy evaluate this, stopping at the first '1' they encounter.
If you can multi-thread, you could speedup calculations.

BTW, I'm assuming you have a lot of matrices to process, otherwise I would use a direct, and readable code. My guess is that even with a lot of matrices, the gain in performance would be negligible.

answered Oct 13 '22 00:10

bavaza

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Binary matrix multiplication bit twiddling hack

Tags:

algorithm

bit-manipulation

matrix

matrix-multiplication

pseudocode

Lu4

People also ask

3 Answers

David Eisenstat

evenex_code

bavaza

Recent Activity

Donate For Us

Binary matrix multiplication bit twiddling hack

Tags:

algorithm

bit-manipulation

matrix

matrix-multiplication

pseudocode

Lu4

People also ask

3 Answers

David Eisenstat

evenex_code

bavaza

Related questions

Recent Activity

Donate For Us