Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best compression algorithm? (see below for definition of best)

I want to store a list of the following tuples in a compressed format and I was wondering which algorithm gives me

  • smallest compressed size
  • fastest de/compression
  • tradeoff optimum ("knee" of the tradeoff curve)

My data looks like this:

(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)

One of the two ints refers to a point in time and it's very likely that the numbers ending up in one list are close to each other. The other int represents an abstract id and the values are less likely to be close, although they aren't going to be completely random, either. The double is representing a sensor reading and while there is some correlation between the values, it's probably not of much use.

like image 315
Hanno Fietz Avatar asked Nov 10 '08 09:11

Hanno Fietz


5 Answers

Since the "time" ints can be close to each other, try to only store the first and after that save the difference to the int before (delta-coding). You can try the same for the second int, too.

Another thing you can try is to reorganize the data from [int1, int2, double], [int1, int2, double]... to [int1, int1...], [int2, int2...], [double, double...].

To find out the compression range your result will be in, you can write your data into a file and download the compressor CCM from Christian Martelock here. I found out that it performs very well for such data collections. It uses a quite fast context mixing algorithm. You can also compare it to other compressors like WinZIP or use a compression library like zLib to see if it is worth the effort.

like image 115
schnaader Avatar answered Oct 18 '22 22:10

schnaader


Here is a common scheme used in most search engines: store deltas of values and encode the delta using a variable byte encoding scheme, i.e. if the delta is less than 128, it can be encoded with only 1 byte. See vint in Lucene and Protocol buffers for details.

This will not give you the best compression ratio but usually the fastest for encoding/decoding throughput.

like image 25
ididak Avatar answered Oct 18 '22 22:10

ididak


If I'm reading the question correctly, you simply want to store the data efficiently. Obviously simple options like compressed xml are simple, but there are more direct binary serialization methods. One the leaps to mind is Google's protocol buffers.

For example, in C# with protobuf-net, you can simply create a class to hold the data:

[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}

Then just [de]serialize a List or Foo[] etc, via the ProtoBuf.Serializer class.

I'm not claiming it will be quite as space-efficient as rolling your own, but it'll be pretty darned close. The protocol buffer spec makes fairly good use of space (for example, using base-128 for integers, such that small numbers take less space). But it would be simple to try it out, without having to write all the serialization code yourself.

This approach, as well as being simple to implement, also has the advantage of being simple to use from other architectures, since there are protocol buffers implementations for various languages. It also uses much less CPU than regular [de]compression (GZip/DEFLATE/etc), and/or xml-based serialization.

like image 44
Marc Gravell Avatar answered Oct 18 '22 23:10

Marc Gravell


Sort as already proposed, then store

(first ints) (second ints) (doubles)

transposed matrix. Then compressed

like image 26
Gilles Avatar answered Oct 18 '22 21:10

Gilles


Most compression algorithms will work equally bad on such data. However, there are a few things ("preprocessing") that you can do to increase the compressibility of the data before feeding it to a gzip or deflate like algorithm. Try the following:

First, if possible, sort the tuples in ascending order. Use the abstract ID first, then the timestamp. Assuming you have many readings from the same sensor, similar ids will be placed close together.

Next, if the measures are taken at regular intervals, replace the timestamp with the difference from the previous timestamp (except for the very first tuple for a sensor, of course.) For example, if all measures are taken at 5 minutes intervals, the delta between two timestamps will usually be close to 300 seconds. The timestamp field will therefore be much more compressible, as most values are equal.

Then, assuming that the measured values are stable in time, replace all readings with a delta from the previous reading for the same sensor. Again, most values will be close to zero, and thus more compressible.

Also, floating point values are very bad candidates for compression, due to their internal representation. Try to convert them to an integer. For example, temperature readings most likely do not require more than two decimal digits. Multiply values by 100 and round to the nearest integer.

like image 45
Stephan Leclercq Avatar answered Oct 18 '22 22:10

Stephan Leclercq