Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compress a sequence of non-repeated number size N bits?

Im trying to compress a sequence of non-negative numbers, where:

  • The value range of each number is from 0 to 2^N-1
  • Each number appears once only (it means there is total 2^N numbers )

    Example for N = 4:

    [14, 1, 8, 2, 12, 6, 0, 10, 4, 13, 5, 7, 15, 9, 3, 11]

So normally each number will cost 4 bits and for 16 numbers we will have to use 16x4 = 64 bits to store them.

Currently I have just thought of compressing them as below:

  • For the first 8 numbers --> use 4 bits to store each of them.
  • For the next 4 numbers ---> only 3 bits/each
  • For the next 2 numbers ---> only 2 bits/each
  • For the next 1 numbers ---> only 1 bit for it.
  • For the last one, actually its not necesssary to be stored (Obviously we should know what the last number is if we know all other 15 numbers )

So the compressed data size will be:

Z = 8 * 4 + 4 * 3 + 2 * 2 + 1 * 1 + 1 * 0 = 49 bits 

The compression ratio is about 76%, which is pretty good (I think).

But for larger values of N, the ratio seems to be decreased (For N = 2048, the ratio is only 91% )

So I would like to hear your suggestions for better compressing.

Thank you.

like image 514
UenX Avatar asked May 08 '15 17:05

UenX


3 Answers

As is pointed out in comments, the optimal encoding -- if all permutations are equally probable -- is to replace the entire permutation with its index in the enumeration of permutations. Since there are n! possible permutations, the index requires log2n! bits, and therefore the compression ratio from the naive encoding using log2n bits for each element is (log n!)/(n log n).

Using Stirling's approximation, we can rewrite that as (n log n - n + O(log n))/(n log n), which is 1 - 1/(log n) + O(1/n) which evidently asymptotically approaches 1 as n grows. So it is inevitable that the compression ratio will decrease for larger n.

It is not possible to achieve better compression unless not all permutations are equally probable (and you have some information about the probability distribution).

like image 156
rici Avatar answered Nov 14 '22 02:11

rici


Currently you are using N*2^N bits.

Basically what you have is a permutation of the numbers, and each permutation is unique, and for permutation you can calculate a unique identifier. Since there are (2^N)! permutations, you will only need ceil(log2((2^N)!)) bits. For you example, this is 45 bits.

like image 41
Karoly Horvath Avatar answered Nov 14 '22 02:11

Karoly Horvath


For this specific problem, the most efficient encoding is to view the permutation of [0 .. 2^N-1] as a numeral in the factorial number system and store the Lehmer code for that permutation.

This gives a requirement of ceil(log2((2^N)!)) bits. For N = 4, this uses 45 bits (70.3%); for N = 11 (2^N = 2048), 19581 bits (86.9%).

The compression ratio worsens as N increases; using the simple approximation log x! >= (x log x) - x + 1 we attain a minimum for log2((2^N)!) / (N 2^N) of 1 - ((2^N - 1)/(2^N))*(1 / (N * log(2))), which approaches 1 as N tends to infinity.

Given this absolute bound on compression ratio, any approach you can find which is reasonably efficient is worth going for; for values as small as N = 15 it's impossible to do better than 90%.

like image 2
ecatmur Avatar answered Nov 14 '22 01:11

ecatmur