I have a set of 2D points stored in an array. I need to compress it as much as I can. Preferably fast, but not a deal breaker, compression rate is the goal. The rules are:
Special properties:
What I've researched so far:
I only managed to implement Huffman and BWT, but neither of them gives me a good compression ratio (or use the main property of my data set). I'm going to try the first option today.
I'm sure there are better ideas. Do you have any? Did you come across anything similar and implemented something really good?
Dataset example, in hex:
00 0A 00 77 00 55 00 80 00 2B 00 B9 00 7A 00 5B
00 F6 00 76 00 B4 00 25 00 47 00 D3 00 F6 00 7D
...
01 05 00 A9 01 B8 00 10 01 4F 00 6A 01 E6 00 DF
01 1F 00 F0 01 BE 00 C3 01 6C 00 87 01 CE 00 44
...
...
15 06 03 F4 15 1E 03 29 15 35 03 10 15 B9 03 22
15 67 03 73 15 EF 03 5C 15 29 03 B8 15 4C 03 2F
...
where e.g. the particle 15 67 03 73
(last row) means particle at x = 15 and 67/256, y = 3 and 73/256. As you see, the data is somewhat ordered but the fractional parts are in a total disarray.
First option from OP is more appropriate. But it may be improved further.
If there is some pattern in points' distribution, you could try more advanced compressors for step #4: LZ*, BWT, or PPM.
Here are results of experimental comparison for methods used in step 4. Worst-case scenario is assumed: points are randomly uniformly distributed in range 00.00 .. FF.FF (so that the only compression possibility is to lose information about their ordering). All results are computed for 250000 points:
method compressed size
------ ---------------
Uncompressed: 1000000
Elias4: 522989
Elias3: 495371
Elias2: 505376
Golomb12: 479802
Golomb13: 472238
Golomb14: 479431
Golomb15: 501422
FSE1: 455367
FSE2: 454120
FSE3: 453862
I didn't try Huffman encoding. FSE is a method similar to arithmetic coding. Numbers after method name show configuration parameters: for Elias coding - how many bits are used to encode each number's bitlength, for Golomb coding - how many least significant bits are left uncompressed, for FSE - how many most significant bits are compressed (along with bitlength).
All the results were produced by this source: http://ideone.com/ulmPzO
Interleave the bits representing the X and Y coordinates of every point, sort and compress.
For instance, if you have the point (X, Y) represented by the two 16 bit numbers
(X15X14X13X12X11X10X9X8X7X6X5X4X3X2X1X0, Y15Y14Y13Y12Y11Y10Y9Y8Y7Y6Y5Y4Y3Y2Y1Y0)
Convert it into the following 32 bit number:
X15Y15X14Y14X13Y13X12Y12X11Y11X10Y10X9Y9X8Y8X7Y7X6Y6X5Y5X4Y4X3Y3X2Y2X1Y1X0Y0
That would take advantage of any clustering that may appear in the data, as near physically near points will appear in near positions on the sorted list and their representations share their head bits.
Update: The point is to have near points sort in near positions. If you mix X and Y bits, you get that, resulting on long sequences of 32bit integers which identical values in its head bits. If you then do deltas, you will have smaller values that if you just sort on X and then on Y (or vice versa).
The thing is that then, you can consider it as a k-d tree, every bit partitions the space (left/right or up/down). For the first levels, you can compress then just saying how many elements there are at one side until you get to the poles with just a few elements that can be represented by stating the remaining few bits explicitly. For the best compression you will have to use arithmetic coding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With