Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimal representation of sparse matrix

I have a large binary sparse matrix (any cell can hold 0 or 1 as value). from time to time I want to take a snapshot of the entire matrix. The snapshot must be as minimal as possible.

The matrix represents a 2d map and events that take place in area, so it is more likely to have snapshot that looks like Example A than snapshots that looks like Example B (they both ave the same number of "1") although I need to support both examples in the algorithm.

Example A:
000000000000
000000011000
001100111100
001111111100
000000111100
000001111100
000000000000

Example B:
010010010010
001000001000
010010100100
000100101010
001000010010
010010001100
001000010000

Since the data can vary from single "1" cell to 100% of the cells as "1" (In very very rear cases) I think that I need to use more than one algorithm - and when loading the data to load it with the same algorithm that stored it.

In example when there is only one cell I will store it's index only (and identifier for the "index" algorithm) and when 99% of the matrix is "1" I will store it as bitmap (and identifier for the "bitmap" algorithm).

So I get to a general algorithm like this:

  1. For every representation algorithem - represent the matrix
  2. Select the smallest representation
  3. Store the data with the smallest representation

My questions

  1. What algorithms can I use - other than storing Index/bitmap?
  2. Is there a good referance that handles this issue?
  3. How can I proove that my solution is the minimal possible?

bottom line: How can I store a bitmap matrix in a minimal way?

EDIT Use case: I have a sparse matrix that I need to transfer over a very low band rate medium. So sending the matrix should contain as few bits as possible, assuming that computation power on both sides of the medium is strong.

like image 588
yossico Avatar asked Aug 02 '15 11:08

yossico


People also ask

What is sparse representation of a matrix?

A sparse matrix is a matrix that is comprised of mostly zero values. Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices. A matrix is sparse if many of its coefficients are zero.

What is triplet representation of sparse matrix?

Triplet Representation (Array Representation) In this representation, the 0th row stores the total number of rows, total number of columns and the total number of non-zero values in the sparse matrix. For example, consider a matrix of size 5 X 6 containing 6 number of non-zero values.

How do you store sparse matrix efficiently?

A sparse matrix can be stored in full-matrix storage mode or a packed storage mode. When a sparse matrix is stored in full-matrix storage mode, all its elements, including its zero elements, are stored in an array.

What is sparse matrix discuss its representation in memory with the help of a detailed example?

Representation of sparse matrix Representing a sparse matrix by a 2D array leads to the wastage of lots of memory. This is because zeroes in the matrix are of no use, so storing zeroes with non-zero elements is wastage of memory. To avoid such wastage, we can store only non-zero elements.


2 Answers

Data compression is a large field (you can start here), and there's no definitive answer to your question. If we knew how to compress data at an optimal rate, there wouldn't be new compression algorithms every year ;)

That being said, what you propose is a good idea: select the best algorithm out of a collection and identify it in the header. In fact, most compression formats use this kind of scheme.

To answer your questions:

What algorithms:

  • Use exising formats : PNG, GIF, or 7zip come to mind.
  • No compression : bitmap, obviously, but just to make sure we're speaking of the same thing, you need to encode 1 bit (and not 1 byte) per element. Bit access is not easily available in most languages (for most bool/boolean types are stored on a full byte). You must use bitwise operations, specialized language feature such as bitfields/bitsets, or libraries. For exemple in C++, std::vector<bool> implements a true bitmap (a simple bool[] does not).
  • Entropy coding : the idea to encode most probable configurations using less storage than less probable ones.
    • sparse storage, what you called index, but contrary to what you implied, it's useful when there are lots of ones, but also lots of zeros (just negate the result!). The case when there's only one 0 or only one 1 are symmetric and should be handled the same way. The worst-case scenario for spare storage is to have as many ones as zeros.
    • block coding, meaning you can store 1111 in fewer than 4 bits if you know it's a likely pattern. But this means that another (less likely) 4-bit pattern will be stored using more than 4 bits, so be careful in selecting your patterns. There are many ways to do that: either fixed-size or variable-size words, and the encoding can be either static or data-dependent. Some exemples are: RLE, Huffman, LZ variants (dictionary-based algorithms used in your favorite compression program)
    • Arithmetic coding is based on the same idea of adapting the storage space per symbol/block based on its probability, but it encodes the whole data in one pass, instead of splitting it in blocks, which often lead to better encoding schemes.
    • Since your data is two-dimensional, it makes sense to process it by 2D-blocks instead of 1D-blocks, so for example you could encode 8x8 tiles. JPEG for exemple does that.
  • Data preparation : before applying your entropy coding algorithm (the one that actually reduces data size), it's often interesting to apply a bijection (bidirectional function, that doesn't reduce data size) to encode your knowledge of the data-model. In your case, you say your data represents events, and those are often specially clustered. Is there a way to represent that as a list of events + the way they propagated ? Or you could use some image-compression schemes such as Fourier transform or some kind of wavelet transform. Or you could simply use the contour matrix (one when the value change between two cells, zero when the value is constant), which might increase the sparsity of your matrix (it does in your exemple A)

Minimality proof:

There's no generally optimal compression algorithm, since compression really depends on the probability distribution of your data. Indeed, it you know your matrix is always the same, you don'y even need to encode it. It you know it's one of two possible matrices, just one bit is enough to encode which one it is.

In the general case, the Shanon's entropy estimates the theoretical minimal number of bits needed to encode a message:

min = E( -log2(P(message)) )

where E is the expectation over all possible messages and P is the probability function. In practice though, you don't know P, so the best you can do it do better than the previous best algorithm ;)

In general though, the more you try to compress your data, the more expensive your program becomes, both in terms of run-time resources (CPU cycles and memory) and development effort.

An example

Just to put it in practice on your exemple 1, to give you inspiration -- even if it's a very bad idea to design from 1 exemple, because it hardly gives you representative probabilities!

Initial data (I will always omit the size header, as it will remain unchanged) -- bitmap, 84 bits (25 ones):

000000000000
000000011000
001100111100
001111111100
000000111100
000001111100
000000000000

In your index scheme, you output a list position of ones. If you generalize, you can work with lists of position + pattern. For exemple, pattern may be the number of consecutive ones, thus you don't need to output multiple positions for blocks of ones. Let's go further, and say we encode 2D patterns, defined by:

  • a 3-bit size (0 to 7)
  • a 1-bit shape:

    • 0 means a row - for example 1111 is a row of size 4 ;
    • 1 means a square - for example:

      11
      11
      

      is a square of size 2.

Then let's say take relative instead of absolute positions, meaning each position encodes how much you advanced since the previous position. For your example, I choose 5-bit positions. Here is how decoding works:

-- block #1
00001 position +1
      absolute position from top-left, since this is the first block
001   pattern-size 1
0     pattern-shape is row
      for size 1, square and row are the same
      (which means my coding isn't optimal)
-- block #2
00100 position +4
      relative to last position 1, this means position 5
010   pattern-size 2
1     pattern-shape is square
-- decoded output below
0100011000...
0000011000...
0000000000...
.............

So with this scheme you can encode your data using 45 bits:

10100 (position [0]+20 from top-left)
010 0 (size: 2, shape: row)
00110 (position [20]+6)
010 1 (size: 2, shape: square)
00100 (position +4)
100 1 (size: 4, shape: square)
01000 (position +8)
100 0 (size: 4, shape: row)
11011 (position +27)
001 0 (size 1, shape: row)

Notes: Usually you'd want to store a header to know the number of blocks (5 in this case), but you can deduce it from the file/network-payload size. Also in this example we only need 3 pattern sizes (1,2,4) so we could store the size on two bits and raise that to the power 2, making the payload 40 bits, but that makes the scheme too specialized. Also, it's necessary to have at least one meaningless shape (here there are two: 000/0 and 000/1), for cases where you don't have enough bits in position to encode a large "hole" between patterns. For example this 20 x 3 matrix:

10000000000000000000
00000000000000000000
00000000000000000001

Have ones at positions 0 and 59. Since I can't encode 59 on as a 5-bit jump, we have to jump twice, and we'll encode it as:

00000 (+0)  001 0 (a single 1)
11111 (+31) 000 0 (nothing)
11100 (+28) 001 0 (a single 1)
like image 131
Antoine Avatar answered Sep 27 '22 23:09

Antoine


You've mentioned some obvious ways to store these - bitmaps and indexes of 1 cells. The index idea could easily be extended by encoding the identifier for the 'indices' method and number of 1 cells, followed by that many pairs of coordinates. You could even try compressing by grouping by row (or column) like

rowNumber colNumber1 colNumber2 ... -1

using -1 or some other flag value to indicate the end of a row. This could save huge amount of space for large-size matrices with only a few entries. You'd need to store the matrix size as well.

For Example A, you'd get (using 0 indexing, whitespace just for readability)

7 12 //dimensions
1 7 8 -1 
2 2 3 6 7 8 9 -1
3 2 3 4 5 6 7 8 9 -1
4 6 7 8 9 -1
5 5 6 7 8 9 -1

For your case, another example of a way to store it might be as follows - you'd have to experiment and see how successful it was at 'compressing' the information. The idea for the algorithm comes from observing that in your A example, almost all rows have just a single large connected section of 1s, surrounded by 0s.

Assuming your matrix can have any dimensions, the first step would be encoding that. So for an n * m matrix, simply storing those two integers.

Following this, for each row, how about the following:

  • store the number of connected 0s at the left.
  • store the number of connected 1s following this.
  • repeat until the end of the row.

To decode, you'd simply have to follow a process like this:

  • for each row (in the given row count)
    • let count be the number of matrix entries read so far. Initialise to 0
    • let next be a boolean, representing what value the next number encodes. Initialise to false.
    • read the next number
      • insert that number of values equal to next into the appropriate row.
      • flip next
      • increment count by that number
    • loop until count == m (if count > m something has gone wrong)
  • loop until all rows have been read

I hope I've explained the idea fairly well. As I said, I don't know how well it would perform, but it arose from an observation about your example. To really squash it down, you could ensure that each number entry required only as many bits as necessary to count up to (and no higher than) the row width (m in my pseudocode.)

Example A would come out something like this (whitespace just for readability):

7 12 //dimensions
12
7 2 3
2 2 2 4 2
2 8 2
6 4 2
5 5 2
12

You'd only need 4 bits for each of the numbers, if you optimised that.

I won't try to do Example B, but it would perform much worse

like image 23
Oly Avatar answered Sep 27 '22 21:09

Oly