Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.

The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.

Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.

Is there another algorithm that would be able to compress this kind of data better? I'm using C++.

Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.

like image 302
Qwertie Avatar asked Feb 26 '23 19:02

Qwertie


1 Answers

You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.

General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.

But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.

As I understand it, you have a large number of data points of the form

XXxxYYyy, where the upper-case letters are very uniform. So factor them out.

Rewrite the list as something similar to this:

XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...

Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.

Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.

Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.

like image 118
jalf Avatar answered Mar 07 '23 00:03

jalf