Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need help to analyze this programming technique to compress an array

I hope readers are aware of shannon's information theory which says that information content associated with an event a with probability p(a) is -log(p(a)). In layman terms if you need to represent a number in the range of 0-7 then you require at least -log(1/8)=log(8) (where base is 2) ie 3 bits.

Suppose there is an array of integers ranging from 0 to 255. Instead of storing the array as 8 bit numbers i will sort the array in ascending order first (keeping a backup ofcourse). Instead of encoding every array element as an 8 bit integer i will output its position in the sorted array. Now the problem is to let the decoder or receiver know this sorted array. I will output the first(least) integer value as an 8 bit number,then the increment to be added to this number and soon. First all of the sorted array followed by the order of the elements i.e the position values.

Ex: original array-> 231 , 3 , 45 , 0 , 23 , 32 , 78

sorted array-> 0,3,23,32,45,78,231

the encoded info is 0(the first element of sorted array as 8 bit num) then 3(this is increment over 0) then 20 then 9 then 13, then 33 then 153.

after sending the first number and successive deltas i will send the order i.e since there are 7 integers here i will need a three bit number for the order, 3(the position of 0 in original array) then 1(position of 3) then 4(position of 23)then 5(position of 32) then 2(position of 45) then 6(position of 78) then 0(position of 231).

i.e the position values are now 3 , 1 , 4 , 5 , 2 , 6 , 0

Analysis to see if this scheme will compress:

first number-> 8 bits (it may actually require less bits since it is the smallest)

next 6 numbers -> 5 bits( the problem is we can encode 0,3,20,9,13 with 5 bits but not 33 and 153 which we might have to encode as 31(maximum for 5 bits))

7 positions of 3 bits each->21 bits

total-> 8+6*5+21=59. which is more than the 56 bits we would have required to encode 7 numbers of 8 bits each, and we have achieved expansion than compression and our scheme is lossy since some large numbers we have not been able to represent proplerly.

Let us add some complexity to this scheme.

I will encode the first 0 as 8 bit number immediately followed by the code for the last number 231. Then i will send code for 3 the next increment over 0 then code for 153 the decrement over 231 then 20 then 33, 9,13

ie i have sent in different order-> instead of 0,3,20,9,13,33,153 i will send as 3,153,20,33,9,13

what i get by this is successive reduction in dynamic range you observe that we have sent 0 then 231 then 3 then 153 by this time the range of values reduces i mean the next increment to 3 that will be 20 cannot be larger than the second last number ie 78 and the number 20 cannot go beyond 75( if it goes then the third number(3+76(say)) will be greater than 78 clearly violation of our sorting assumption.

If you have understood the idea till now i have a further improved scheme to use binary search idea to further reduce the dynamic range and put this technique on steroids. Here is the sorted array

0 , 3 , 23 , 32 , 45 , 78 , 231

observe that the sorted array is having 7 numbers and the middle one is 32. So now we will encode this 32 with 8 bits then we will send the deltas in preorder. ie next number after 32 will be 3 which will be encoded as 29( ie 32-3) and next one will be 78 encoded as 46(78-32), then 0 encoded as 3(3-0) then 23 encoded as 20(23-3) then 45 encoded as 33(78-45) then the last one 231 encoded as 153(231-78).

If you now see we can decide how many bits to use for each number here on a case by case basis.

we will be sending the sorted array as 32(range 0-255 so 8 bits),29(range 0-32 so 6 bits),46(range 32-255 so 8 bits),3(range 0-3 so 2 bits) ,20(range 3-32 so 5 bits),33(range 32-78 so 6 bits),153(range 78-255 so 8 bits)

so totally 8+6+8+2+5+6+8=43 which is non lossy and more than our initial estimate of 38( 8 bits + 5*6 bits) so this added with the 7 position values of three bits each totally 43+21=64 is more than 56. Our scheme is still expanding.

What improvement can we do to the position numbers which are 21 bits. Since every time we send position info the number of positions reduces by one if we have 7 positions to send then number of bits is log(7)+log(6)+log(5).... This is then log(fact(7)) bits where all logarithms are base 2.

Observe that i have used the formula log(a)+log(b)=log(ab)

This is equal to 12.299 which when added with 43 equals 55.299 which is a tad lower than 56. But this is not practical. We need at least 3(range 7)+3(range 6)+3(range 5)+2(range 4)+2(range 3)+1(range 2)+0(range 1)=14 which when added with 43 gives 57 which is expansion.

The goal of this effort is to achieve at least 1 bit reduction in data size. If we compress 56 bits into 55 without any assumptions about data then we can take the output of 55 bits and compress it again to 54 bits and soon. This looks impossible and the idea is similar to perpetual machines. The task now is to see what stops us from compressing more.

I need to analyze taking an example of a bigger array to see if 43 bits of the sorted array can be lesser than 43. Also what is the advantage of splitting an array into many parts and encoding each part seperately. Also a goal is to find what is the formula to calculate number of bits required to represent a sorted array. i.e given an array size and range of array elements how to find numbers like 43.

Lets take this 3,1,4,5,2,6,0 as an unsorted array again and observe that this sequence is one of 5040 permutations of seven numbers from 0 to 6. We can represent this as a 13 bit number(12.299 as theory suggests).

I need to know is it possible to compress this array even more.

like image 615
Muk Bo Avatar asked Oct 09 '22 04:10

Muk Bo


2 Answers

If we compress 56 bits into 55 without any assumptions about data then we can take the output of 55 bits and compress it again to 54 bits and soon. This looks impossible and the idea is similar to perpetual machines. The task now is to see what stops us from compressing more.

It is not possible to ever have a loss-less compression algorithm without any assumptions about the data that is guaranteed to decrease the size of all possible data values. Simply by pigeon hole principle we can see the following. When you use n bits you can represent 2^n values. Using n-1 bits you can represent only 2^(n-1) values. Hence, if you encode half of the original values, the next value has to be encoded using the same bits as one of the already encoded values, hence you loose information. Of course, if in the original data you only use less than 2^(n-1) different values, you can decrease the size of that data by a bit (or more), but this is already making an assumption about the data. Furthermore, you won't be able to use that approach to recursively decrease the size of the data without any losses.

Hence you may find some way of compressing the array by a bit, but only in the case if the current way of compression is using at most half of the possible bit patterns. This may be some obscure way of compressing the array though and surely will use more than half of the bit patterns of some k bits. This k will be your threshold and you won't be able to decrease the size any more.

Also what is the advantage of splitting an array into many parts and encoding each part seperately.

If you split the array into smaller parts, the local differences will be lower and hence you may use fewer bits to represent the differences between the numbers. Hence in array like [1, 2, 3, 4, 2^30, 2^30+1, 2^30+2, 2^30+3] you can save some space. You will however have to use more bits to represent the new absolute values. They again could be represented as distances to some arbitrary absolute value to save some space. But I am not sure it is really worth all the effort you outlined to maybe save 1 bit in some cases.

To sum it up. If you have an array like [2^30, 2^30+1, 2^30+2, 2^30+3], you can obviously save some space by taking the differences between the numbers, but as you already stated in your answer, in some cases it increases the size of the data. Hence, you cannot have a compression algorithm which stores any (without making assumptions) array of the numbers using less than n bits, where n is the sum of the ceilings of the logarithms of the numbers in the array.

like image 137
Laky Avatar answered Oct 12 '22 17:10

Laky


The goal of this effort is to achieve at least 1 bit reduction in data size.

That is not possible over all inputs. You can waste a great deal of effort in trying to properly count the bits in various representations, make mistakes, fix them, etc., when all you really need to do is count how many cases there are.

There 2^k possible inputs, where k is the number of bits in the input. Let's say you believe that you have a k-1 bit representation of every single input. Then there are 2^(k-1) possible representations. Then if you feed every single one of those 2^(k-1) representations to your decompressor, you will obviously only get 2^(k-1) results. The other 2^(k-1) possible inputs are missing in action. There is no way to generate those missing inputs from your representation, which means that in fact your representation cannot cover all of the possible 2^k inputs. At least half of them are not covered.

like image 32
Mark Adler Avatar answered Oct 12 '22 19:10

Mark Adler