I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.
Now these numbers are in fixed format string.
What are the algorithms for compressing numbers in fixed format? Is number only compressions (if there any) better than string compression? Should I directly compress string without converting it to numbers and then compress?
Thanks in advance.
This is one of these places where a little theory is helpful. You need to think about several things:
Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.
Similarly with date time, what's the range; how many bits must you have?
Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.
The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.
Update: Oh, wait, fixed point strings?!
Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have
040.00105.0020090518212100Z
which you could convert to
| 4000 | short int, 16 bits | | 10500 | short int, 16 bits | | 20090518212100Z | 64 bits |
So that's 96 bits, 12 bytes versus 26 bytes.
Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).
Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.
I recommend you have a look at The Data Compression Book
What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.
Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.
In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.
An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.
That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With