Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compression of a 2D point set - ideas?

I have a set of 2D points stored in an array. I need to compress it as much as I can. Preferably fast, but not a deal breaker, compression rate is the goal. The rules are:

  • a point = a 32-bit structure, stored as (x,y), 2 bytes for each coordinate
  • a coordinate = a "float" with 8 bits integer part, 8 bits fractional part

Special properties:

  • I may change the order of the points as I see fit
  • I'm given the points in the order of the integer parts of their x and y, maybe I can exploit that, but the fractional parts are pretty much random from what I saw
  • the array I receive is contiguous (from a memory standpoint)

What I've researched so far:

  • consider them plain integers (32 bit), sort them (the order is mine to choose), then compress it as in this question.
  • consider my array as a plain char string, then apply a Burrows-Wheeler transform (BWT) with run-length encoding, or Huffman
  • treat my array as plain binary data, then apply an LZW

I only managed to implement Huffman and BWT, but neither of them gives me a good compression ratio (or use the main property of my data set). I'm going to try the first option today.

I'm sure there are better ideas. Do you have any? Did you come across anything similar and implemented something really good?

Dataset example, in hex:

00 0A 00 77 00 55 00 80 00 2B 00 B9 00 7A 00 5B 
00 F6 00 76 00 B4 00 25 00 47 00 D3 00 F6 00 7D
...
01 05 00 A9 01 B8 00 10 01 4F 00 6A 01 E6 00 DF
01 1F 00 F0 01 BE 00 C3 01 6C 00 87 01 CE 00 44
...
...
15 06 03 F4 15 1E 03 29 15 35 03 10 15 B9 03 22
15 67 03 73 15 EF 03 5C 15 29 03 B8 15 4C 03 2F
...

where e.g. the particle 15 67 03 73 (last row) means particle at x = 15 and 67/256, y = 3 and 73/256. As you see, the data is somewhat ordered but the fractional parts are in a total disarray.

like image 397
webuster Avatar asked Apr 20 '15 10:04

webuster


2 Answers

First option from OP is more appropriate. But it may be improved further.

  1. Reinterpret coordinates as 16-bit integers.
  2. Transform point positions into distances along Hilbert curve (or any other Space-filling curve).
  3. Sort distances, then apply delta-encoding (compute differences of adjacent distances).
  4. Depending on compression/speed preferences, (a) use something like Elias or Golomb codes (fastest), (b) use Huffman encoding, or (c) use something like arithmetic encoding (best compression rate).

If there is some pattern in points' distribution, you could try more advanced compressors for step #4: LZ*, BWT, or PPM.


Here are results of experimental comparison for methods used in step 4. Worst-case scenario is assumed: points are randomly uniformly distributed in range 00.00 .. FF.FF (so that the only compression possibility is to lose information about their ordering). All results are computed for 250000 points:

method        compressed size
------        ---------------
Uncompressed: 1000000
Elias4:        522989
Elias3:        495371
Elias2:        505376
Golomb12:      479802
Golomb13:      472238
Golomb14:      479431
Golomb15:      501422
FSE1:          455367
FSE2:          454120
FSE3:          453862

I didn't try Huffman encoding. FSE is a method similar to arithmetic coding. Numbers after method name show configuration parameters: for Elias coding - how many bits are used to encode each number's bitlength, for Golomb coding - how many least significant bits are left uncompressed, for FSE - how many most significant bits are compressed (along with bitlength).

All the results were produced by this source: http://ideone.com/ulmPzO

like image 158
Evgeny Kluev Avatar answered Nov 05 '22 03:11

Evgeny Kluev


Interleave the bits representing the X and Y coordinates of every point, sort and compress.

For instance, if you have the point (X, Y) represented by the two 16 bit numbers

(X15X14X13X12X11X10X9X8X7X6X5X4X3X2X1X0, Y15Y14Y13Y12Y11Y10Y9Y8Y7Y6Y5Y4Y3Y2Y1Y0)

Convert it into the following 32 bit number:

X15Y15X14Y14X13Y13X12Y12X11Y11X10Y10X9Y9X8Y8X7Y7X6Y6X5Y5X4Y4X3Y3X2Y2X1Y1X0Y0

That would take advantage of any clustering that may appear in the data, as near physically near points will appear in near positions on the sorted list and their representations share their head bits.

Update: The point is to have near points sort in near positions. If you mix X and Y bits, you get that, resulting on long sequences of 32bit integers which identical values in its head bits. If you then do deltas, you will have smaller values that if you just sort on X and then on Y (or vice versa).

The thing is that then, you can consider it as a k-d tree, every bit partitions the space (left/right or up/down). For the first levels, you can compress then just saying how many elements there are at one side until you get to the poles with just a few elements that can be represented by stating the remaining few bits explicitly. For the best compression you will have to use arithmetic coding.

like image 40
salva Avatar answered Nov 05 '22 01:11

salva