Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binary vectors and matrix manipulation in C

I'm trying to implement in C a data structure that would allow me to manipulate efficiently a**binary** matrix (containing only 1 or 0). I'll explain what operations I have to apply to this matrix, and would like to know what's the best possible data structure to use ?

The operations are done in the field F_2 (which means 1+1 = 0 the other operations remain unchanged). I have one k*n matrix (k < n) called H. At most, k = 2325 and n = 3009.

The operations that I will have to do over this matrix are :

I will partially diagonalize it using only row swap and row additions. Once it is done, I will not use anymore row operations and will operate a lot (!) of columns additions over this matrix (what I mean by "a lot" is about ((n-k)/2)³ columns additions)

Data structure I was considering for the matrix :

For the matrix coefficient, I was thinking about storing sequences of multiple bits at once in one single unsigned int. For instance, I could store the sequence (11001011) to the uint8_t 203 (converting from binary to decimal)

  • Is it a good idea ?

If I do so, I have two options :

I could use uint16_t or uint64_t coefficients to split my matrix H in many 4*4 or 8*8 submatrices.

  • Is this a good option (in terms of time efficiency) and if it is, is it better to use uint16_t or uint64_t ?

Else I was thinking about storing every row in multiple uint32_t or uint64_t, then operate my partial diagonalization. Next switch to a structure that would code the matrix as n columns vectors to handle the remaining operations.

  • Do you think this is more efficient ?

Whatever method I use, I will have to efficiently access the n'th bit of an unsigned int (uint16, 32 or 64). How do I do that ?

like image 658
Celerio Avatar asked Jul 31 '14 07:07

Celerio


1 Answers

For best performance, use an array of row pointers for the row swap and row additions. Use <stdint.h>, and a fast unsigned integer type of minimum supported word size -- I recommend either uint_fast32_t, unless you intend to run this on 16- or 8-bit processors.

When all the row swap and row additions are done, transpose the array. Although this operation is "slow", the following column operations will be so fast to offset the transpose cost.

Consider the following:

#include <stdint.h>
#include <limits.h>

typedef uint_fast32_t  word;
#define WORD_BITS      (CHAR_BIT * sizeof (word))

typedef struct {
    int    rows;  /* Number of row vectors */
    int    cols;  /* Number of defined bits in each row */
    int    words; /* Number of words per row vector */
    word **row;   /* Array of pointers */
} row_matrix;

typedef struct {
    int    rows;  /* Number of defined bits in each column */
    int    cols;  /* Number of col vectors */
    int    words; /* Number of words per col vector */
    word **col;
} col_matrix;

Although you could use a single type to describe the two matrix forms, using separate types makes the code and functions easier to maintain. You'll end up having some duplicated code, but that's a minor issue compared to having clear, intuitive types.

On 32-bit systems, uint_fast32_t is typically a 32-bit type. On 64-bit systems, it is typically 64-bit. The WORD_BITS macro expands to the number of bits in a word -- it is NOT always 32!

The easiest way to number the bits is to designate the leftmost bit in the matrix as bit 0, and store the bits in the least significant bits in each word. If you have row_matrix *rm, then the bit at row row, column col is

!!(rm->row[row][col / WORD_BITS] & ((word)1U << (col % WORD_BITS)))

The !! is the not-not operator: if the argument is nonzero, it yields 1, otherwise it yields 0. Because we mask a single bit from the word, the "bit is set" value would otherwise be some power of two (1, 2, 4, 8, 16, 32, 64, and so on).

To set the bit, use

rm->row[row][col / WORD_BITS] |= (word)1U << (col % WORD_BITS);

To clear a bit, you need to binary-AND with a mask having all except the target bit 1. This is easy to achieve using the not operator ~:

rm->row[row][col / WORD_BITS] &= ~((word)1U << (col % WORD_BITS));

The corresponding operations for the col_matrix *cm are

!!(cm->col[col][row / WORD_BITS] & ((word)1U << (row % WORD_BITS)))
cm->col[col][row / WORD_BITS] |= (word)1U << (row % WORD_BITS);
cm->col[col][row / WORD_BITS] &= ~((word)1U << (row % WORD_BITS));

Although the division / and modulus (or remainder) % are generally slow-ish (compared to addition, subtraction, and even multiplication), here WORD_BITS will be a power of two compile-time constant on all widely used architectures. All compilers I know of will turn the above into fast bit shifts and binary-AND operators.

To add row srcrow to row dstrow, you simply do a binary exclusive-or on all the words:

{
    const word *const src = rm->row[srcrow];
    word *const       dst = rm->row[dstrow];
    int               w = rm->words;
    while (w-->0)
        dst[w] ^= src[w];
}

and analogously, for the column matrix,

{
    const word *const src = cm->col[srccol];
    word *const       dst = cm->col[dstcol];
    int               w = cm->words;
    while (w-->0)
        dst[w] ^= src[w];
}

Note that if you combine more than two rows, you can do so very efficiently; it'll be MUCH faster than doing the additions consecutively. Intel and AMD CPUs are very good at predicting the above pattern, so you can just use more than one source row/column. Also, the destination does not have to participate in the result, although if I guess correctly what algorithm you're implementing, I guess you want it to.

If you know the target architectures have SSE2 or better, or even AVX, you can use the emmintrin.h or immintrin.h header files, respectively, for compiler built-in types and operators that allow you to XOR 128 bits and 256 bits, respectively, at once; sometimes giving you quite a bit of a boost.

Since the vector types require what the C standards call "excess alignment", you'll then also need to include mm_malloc.h, and use _mm_malloc() and _mm_free() to allocate the row/column vectors for the word data -- and obviously round words up so you can access the row/column as a suitable integer word type (__m128i for SSE*, __m256i for AVX).

Personally, I always implement the unvectorized version first, then some "nasty" test cases for unit testing, and only then see about vectorizing it. This has the benefit that you can give the unvectorized version as a preliminary version for those who will be using it, and you can compare the test case results between the vectorized and unvectorized cases, to see if one or the other has a bug in it.

The transpose operation is quite straightforward, although I do recommend using a triple loop: innermost looping over the bits in a single word. Also, you'll probably want to check which order -- row or column major -- works best for the outer loop; depending on the matrix size, you may see a huge difference. (This is due to cache behaviour: you want the CPU to be able to predict the access pattern, and not have to reload the same cachelines. In the best case, on at-most-few-years-old AMD and Intel x86-64 processors, you can get near cache speeds if both matrixes fit in the cache.)

All of the above can be implemented in a single header file -- even including the vectorized versions if the target architecture supports SSE2/AVX --, so it should not be too difficult to implement.

Questions?

like image 168
Nominal Animal Avatar answered Sep 20 '22 01:09

Nominal Animal