Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.

For example, the following matrix:

1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1

is represented as:

const uint8_t matrix[] = {
    0b01010101,
    0b00110011,
    0b00001111,
};

The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.

I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.

Just to clarify, for the above example, I'd like to get the following result from the transposition:

const uint8_t transposed[] = {
    0b00000000,
    0b00000100,
    0b00000010,
    0b00000110,
    0b00000001,
    0b00000101,
    0b00000011,
    0b00000111,
};

NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

like image 699
Venemo Avatar asked Jul 31 '15 09:07

Venemo


1 Answers

Here is the text of Jay Foad's email to me regarding fast Boolean matrix transpose:

The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.

A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:

transpose8x8(word) {

  return
    (word & 0x0100000000000000) >> 49 // top right corner

  | (word & 0x0201000000000000) >> 42

  | ...

  | (word & 0x4020100804020100) >> 7 // just above diagonal

  | (word & 0x8040201008040201) // leading diagonal

  | (word & 0x0080402010080402) << 7 // just below diagonal

  | ...
  | (word & 0x0000000000008040) << 42

  | (word & 0x0000000000000080) << 49; // bottom left corner

}

This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.

Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):

shuffle(word) {
    return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle

transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }

POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

like image 148
Robert Bernecky Avatar answered Sep 28 '22 10:09

Robert Bernecky