How to compress a sequence of non-repeated number size N bits?

Question

Im trying to compress a sequence of non-negative numbers, where:

The value range of each number is from 0 to 2^N-1
Each number appears once only (it means there is total 2^N numbers )

Example for N = 4:

[14, 1, 8, 2, 12, 6, 0, 10, 4, 13, 5, 7, 15, 9, 3, 11]

So normally each number will cost 4 bits and for 16 numbers we will have to use 16x4 = 64 bits to store them.

Currently I have just thought of compressing them as below:

For the first 8 numbers --> use 4 bits to store each of them.
For the next 4 numbers ---> only 3 bits/each
For the next 2 numbers ---> only 2 bits/each
For the next 1 numbers ---> only 1 bit for it.
For the last one, actually its not necesssary to be stored (Obviously we should know what the last number is if we know all other 15 numbers )

So the compressed data size will be:

Z = 8 * 4 + 4 * 3 + 2 * 2 + 1 * 1 + 1 * 0 = 49 bits

The compression ratio is about 76%, which is pretty good (I think).

But for larger values of N, the ratio seems to be decreased (For N = 2048, the ratio is only 91% )

So I would like to hear your suggestions for better compressing.

Thank you.

rici · Accepted Answer

As is pointed out in comments, the optimal encoding -- if all permutations are equally probable -- is to replace the entire permutation with its index in the enumeration of permutations. Since there are n! possible permutations, the index requires log₂n! bits, and therefore the compression ratio from the naive encoding using log₂n bits for each element is (log n!)/(n log n).

Using Stirling's approximation, we can rewrite that as (n log n - n + O(log n))/(n log n), which is 1 - 1/(log n) + O(1/n) which evidently asymptotically approaches 1 as n grows. So it is inevitable that the compression ratio will decrease for larger n.

It is not possible to achieve better compression unless not all permutations are equally probable (and you have some information about the probability distribution).

Karoly Horvath · Answer

Currently you are using N*2^N bits.

Basically what you have is a permutation of the numbers, and each permutation is unique, and for permutation you can calculate a unique identifier. Since there are (2^N)! permutations, you will only need ceil(log2((2^N)!)) bits. For you example, this is 45 bits.

ecatmur · Answer

For this specific problem, the most efficient encoding is to view the permutation of [0 .. 2^N-1] as a numeral in the factorial number system and store the Lehmer code for that permutation.

This gives a requirement of ceil(log2((2^N)!)) bits. For N = 4, this uses 45 bits (70.3%); for N = 11 (2^N = 2048), 19581 bits (86.9%).

The compression ratio worsens as N increases; using the simple approximation log x! >= (x log x) - x + 1 we attain a minimum for log2((2^N)!) / (N 2^N) of 1 - ((2^N - 1)/(2^N))*(1 / (N * log(2))), which approaches 1 as N tends to infinity.

Given this absolute bound on compression ratio, any approach you can find which is reasonably efficient is worth going for; for values as small as N = 15 it's impossible to do better than 90%.

How to compress a sequence of non-repeated number size N bits?

Tags:

c++

algorithm

compression

UenX

3 Answers

rici

Karoly Horvath

ecatmur

Recent Activity

Donate For Us

How to compress a sequence of non-repeated number size N bits?

Tags:

c++

algorithm

compression

UenX

3 Answers

rici

Karoly Horvath

ecatmur

Related questions

Recent Activity

Donate For Us