I have file that consists of a sequence of real time unix timestamps with microsecond accuracy, i.e. the timestamps can never decrease. All the timestamps that need to be coded/decoded are from the same day. A sample entry in the file might be something like 1364281200.078739 that corresponds to 1364281200078739 usecs since epoch. Data is unevenly spaced and bounded.
I need to achieve a compression of around 10 bits/timestamp. Currently I am able to compress to average of 31 bits/timestamp by calculating difference between consecutive timestamps. How can I improve further ?
Edit:
We are calculating Degree of Compression as (Size of encoded file in bytes)/(Number of timestamps)*8. I divided the timestamps into two parts before '.' and after it. The integer part is quite constant and max difference between two integer part timestamps is 32 so I encoded it using 0-8 bits. The precision part is quite random so I have ignored the leading bits and wrote into file using 0-21 bits(as max it can be 999999). But the size of my encoded file is coming as 4007674 bytes and hence compression as 71.05 bits/TS. I also write '.' and a space between two timestamps to decode later. How can I improve upon my size of encoded file ?
Here is the link for partial data set - http://pastebin.com/QBs9Bqv0
Here is the link for differential timestamps value in micro-seconds - http://pastebin.com/3QJk1NDV Maximum difference b/w timestamps is - 32594136 micro sec.
If you take the interval between every timestamp and the previous one and express it in microseconds (i.e. as integers), the distribution of values per bit depth in your sample file is:
So 52.285% of the values are either 0 or 1, there are only a handful of other values below 64 (2~6-bits), 27.59% of the values are 7~12-bits, there's a fairly even distribution of around 2.1% per bit up to 20-bits, and only 3% above 20-bits, with a maximum of 25-bits. Looking at the data, it's also obvious that there are many sequences of up to 6 consecutive zeros.
These observations gave me the idea of using a variable bit-size per value, something like this:
00 0xxxxx 0 (xxxxx is the number of consecutive zeros) 00 1xxxxx 1 (xxxxx is the number of consecutive ones) 01 xxxxxx xxxxxxxx 2-14 bit values 10 xxxxxx xxxxxxxx xxxxxxxx 15-22 bit values 11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 23-30 bit values
A quick test showed that this resulted in a compression rate of 13.78 bits per timestamp, which isn't quite the 10 bits you were aiming for, but not a bad start for a simple scheme.
After analysing the sample data some more, I observed that there are a lot of short sequences of consecutive 0's and 1's, like 0 1 0
, so I replaced the 1-byte scheme with this one:
00xxxxxx 00 = identifies a one-byte value xxxxxx = index in the sequence table
The table of sequences:
index ~ seq index ~ seq index ~ seq index ~ seq index ~ seq index ~ seq 0 0 2 00 6 000 14 0000 30 00000 62 000000 1 1 3 01 7 001 15 0001 31 00001 63 000001 4 10 8 010 16 0010 32 00010 5 11 ... ... ... 11 101 27 1101 59 11101 12 110 28 1110 60 11110 13 111 29 1111 61 11111
For the example file with 451,210 timestamps, this brings down the encoded file size to 676,418 bytes, or 11.99 bits per timestamp.
Testing the above method revealed that there were 98,578 single zeros and 31,271 single ones between larger intervals. So I tried using 1 bit of each larger interval to store whether it was followed by a zero, which decreased the encoded size to 592,315 bytes. And when I used 2 bits to store whether larger intervals were followed by 0, 1 or 00 (the most common sequence), the encoded size decreased to 564,034 bytes, or 10.0004 bits per timestamp.
I then changed to storing the single 0's and 1's with the following large interval instead of the preceding one (purely for reasons of code simplicity) and found that this resulted in a file size of 563.884 bytes, or 9.997722 bits per timestamp!
So the complete method is:
Store the first timestamp (8 bytes), then store the intervals as either: 00 iiiiii sequences of up to 5 (or 6) zeros or ones 01 XXxxxx xxxxxxxx 2-12 bit values (2 ~ 4,095) 10 XXxxxx xxxxxxxx xxxxxxxx 13-20 bit values (4,096 ~ 1,048,575) 11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx 21-28 bit values (1,048,576 ~ 268,435,455) iiiiii = index in sequence table (see above) XX = preceded by a zero (if XX=1), a one (if XX=2) or two zeros (if XX=3) xxx... = 12, 20 or 28 bit value
Example of an encoder:
#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;
void write_timestamp(ofstream& ofile, uint64_t timestamp) { // big-endian
uint8_t bytes[8];
for (int i = 7; i >= 0; i--, timestamp >>= 8) bytes[i] = timestamp;
ofile.write((char*) bytes, 8);
}
int main() {
ifstream ifile ("timestamps.txt");
if (! ifile.is_open()) return 1;
ofstream ofile ("output.bin", ios::trunc | ios::binary);
if (! ofile.is_open()) return 2;
long double seconds;
uint64_t timestamp;
if (ifile >> seconds) {
timestamp = seconds * 1000000;
write_timestamp(ofile, timestamp);
}
while (! ifile.eof()) {
uint8_t bytesize = 0, len = 0, seq = 0, bytes[4];
uint32_t interval;
while (bytesize == 0 && ifile >> seconds) {
interval = seconds * 1000000 - timestamp;
timestamp += interval;
if (interval < 2) {
seq <<= 1; seq |= interval;
if (++len == 5 && seq > 0 || len == 6) bytesize = 1;
} else {
while (interval >> ++bytesize * 8 + 4);
for (uint8_t i = 0; i <= bytesize; i++) {
bytes[i] = interval >> (bytesize - i) * 8;
}
bytes[0] |= (bytesize++ << 6);
}
}
if (len) {
if (bytesize > 1 && (len == 1 || len == 2 && seq == 0)) {
bytes[0] |= (2 * len + seq - 1) << 4;
} else {
seq += (1 << len) - 2;
ofile.write((char*) &seq, 1);
}
}
if (bytesize > 1) ofile.write((char*) bytes, bytesize);
}
ifile.close();
ofile.close();
return 0;
}
Example of a decoder:
#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;
uint64_t read_timestamp(ifstream& ifile) { // big-endian
uint64_t timestamp = 0;
uint8_t byte;
for (uint8_t i = 0; i < 8; i++) {
ifile.read((char*) &byte, 1);
if (ifile.fail()) return 0;
timestamp <<= 8; timestamp |= byte;
}
return timestamp;
}
uint8_t read_interval(ifstream& ifile, uint8_t *bytes) {
uint8_t bytesize = 1;
ifile.read((char*) bytes, 1);
if (ifile.fail()) return 0;
bytesize += bytes[0] >> 6;
for (uint8_t i = 1; i < bytesize; i++) {
ifile.read((char*) bytes + i, 1);
if (ifile.fail()) return 0;
}
return bytesize;
}
void write_seconds(ofstream& ofile, uint64_t timestamp) {
long double seconds = (long double) timestamp / 1000000;
ofile << seconds << "\n";
}
uint8_t write_sequence(ofstream& ofile, uint8_t seq, uint64_t timestamp) {
uint8_t interval = 0, len = 1, offset = 1;
while (seq >= (offset <<= 1)) {
seq -= offset;
++len;
}
while (len--) {
interval += (seq >> len) & 1;
write_seconds(ofile, timestamp + interval);
}
return interval;
}
int main() {
ifstream ifile ("timestamps.bin", ios::binary);
if (! ifile.is_open()) return 1;
ofstream ofile ("output.txt", ios::trunc);
if (! ofile.is_open()) return 2;
ofile.precision(6); ofile << std::fixed;
uint64_t timestamp = read_timestamp(ifile);
if (timestamp) write_seconds(ofile, timestamp);
while (! ifile.eof()) {
uint8_t bytes[4], seq = 0, bytesize = read_interval(ifile, bytes);
uint32_t interval;
if (bytesize == 1) {
timestamp += write_sequence(ofile, bytes[0], timestamp);
}
else if (bytesize > 1) {
seq = (bytes[0] >> 4) & 3;
if (seq) timestamp += write_sequence(ofile, seq - 1, timestamp);
interval = bytes[0] & 15;
for (uint8_t i = 1; i < bytesize; i++) {
interval <<= 8; interval += bytes[i];
}
timestamp += interval;
write_seconds(ofile, timestamp);
}
}
ifile.close();
ofile.close();
return 0;
}
Because of the long double output bug in the MinGW/gcc 4.8.1 compiler I'm using, I had to use this workaround: (this shouldn't be necessary with other compilers)
void write_seconds(ofstream& ofile, uint64_t timestamp) {
long double seconds = (long double) timestamp / 1000000;
ofile << "1" << (double) (seconds - 1000000000) << "\n";
}
Note to future readers: this method is based on analysis of an example data file; it will not give the same compression rate if your data is different.
If you need lossless compression at microsecond accuracy, then note that 10 bits will let you count to 1024.
If the timing of your events is random, and you actually need the microsecond accuracy that you spec'ed, this means your differential timestamps cannot have a variance of more than about 1 millisecond, without exceeding your 10 bit/event budget.
Based on a quick glance at your data, you probably can't quite make 10 bits/timestamp. However, your differencing is the right first step, and you can do better than 31 bits -- I would do statistics on a sample dataset, and choose a binary prefix encoding that reflects that distribution.
You should make sure your code has room to encode large gaps if necessary, so consider basing it on on a universal code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With