I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.
For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t
or uint64_t[4]
respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.
However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.
So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)
Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.
Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity
From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.
This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).
Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:
a b c a e i d h c g b f
B= d e f B(0) = a e i B(1) = d h c B(2) = g b f
g h i a e i d h c g b f
And after thinking it a bit further, a better option is to ^^^
(row wise rotate) matrix B and select A(n) == column copied diagonals from A:
a b c a a a b b b c c c
A= d e f A(0) = e e e , A(1) = f f f, A(2) = d d d
g h i i i i g g g h h h
EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.
#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
// these arrays can be read in two successive xmm registers or in a single ymm
uint16_t D[16]; // Temporary
uint16_t C[16]={0}; // result
uint16_t B[16];
uint16_t A[16];
int i,j;
uint16_t top_row;
// Preprocess B (while reading from input)
// -- "un-tilt" the diagonal to bit position 0x8000
for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
for (i=0;i<W;i++) A[i]=a[i]; // Just read in matrix 'a'
// Loop W times
// Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
for (j=0;j<W;j++) {
for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15; // copy sign bit to rows
for (i=0;i<W;i++) B[i]<<=1; // Prepare B for next round
for (i=0;i<W;i++) C[i]^= A[i]&D[i]; // Add the partial product
top_row=A[0];
for (i=0;i<W-1;i++) A[i]=A[i+1];
A[W-1]=top_row;
}
for (i=0;i<W;i++) c[i]=C[i]; // return result
}
How about padding it out to the next "clever" (e.g. 8 or 16) size, with all '1' on the diagonal?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With