Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing unix timestamps with microseconds accuracy

I have file that consists of a sequence of real time unix timestamps with microsecond accuracy, i.e. the timestamps can never decrease. All the timestamps that need to be coded/decoded are from the same day. A sample entry in the file might be something like 1364281200.078739 that corresponds to 1364281200078739 usecs since epoch. Data is unevenly spaced and bounded.

I need to achieve a compression of around 10 bits/timestamp. Currently I am able to compress to average of 31 bits/timestamp by calculating difference between consecutive timestamps. How can I improve further ?

Edit:

We are calculating Degree of Compression as (Size of encoded file in bytes)/(Number of timestamps)*8. I divided the timestamps into two parts before '.' and after it. The integer part is quite constant and max difference between two integer part timestamps is 32 so I encoded it using 0-8 bits. The precision part is quite random so I have ignored the leading bits and wrote into file using 0-21 bits(as max it can be 999999). But the size of my encoded file is coming as 4007674 bytes and hence compression as 71.05 bits/TS. I also write '.' and a space between two timestamps to decode later. How can I improve upon my size of encoded file ?

Here is the link for partial data set - http://pastebin.com/QBs9Bqv0

Here is the link for differential timestamps value in micro-seconds - http://pastebin.com/3QJk1NDV Maximum difference b/w timestamps is - 32594136 micro sec.

like image 648
learner Avatar asked Dec 24 '22 07:12

learner


2 Answers

If you take the interval between every timestamp and the previous one and express it in microseconds (i.e. as integers), the distribution of values per bit depth in your sample file is:

interval bit depth distribution

So 52.285% of the values are either 0 or 1, there are only a handful of other values below 64 (2~6-bits), 27.59% of the values are 7~12-bits, there's a fairly even distribution of around 2.1% per bit up to 20-bits, and only 3% above 20-bits, with a maximum of 25-bits. Looking at the data, it's also obvious that there are many sequences of up to 6 consecutive zeros.

These observations gave me the idea of using a variable bit-size per value, something like this:

00 0xxxxx                             0 (xxxxx is the number of consecutive zeros)
00 1xxxxx                             1 (xxxxx is the number of consecutive ones)
01 xxxxxx xxxxxxxx                    2-14 bit values
10 xxxxxx xxxxxxxx xxxxxxxx           15-22 bit values
11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx  23-30 bit values

A quick test showed that this resulted in a compression rate of 13.78 bits per timestamp, which isn't quite the 10 bits you were aiming for, but not a bad start for a simple scheme.


After analysing the sample data some more, I observed that there are a lot of short sequences of consecutive 0's and 1's, like 0 1 0, so I replaced the 1-byte scheme with this one:

00xxxxxx    00     = identifies a one-byte value
            xxxxxx = index in the sequence table

The table of sequences:

index ~ seq    index ~ seq    index ~ seq    index ~ seq    index ~ seq     index ~ seq
  0      0       2     00       6     000     14    0000     30   00000     62   000000
  1      1       3     01       7     001     15    0001     31   00001     63   000001
                 4     10       8     010     16    0010     32   00010    
                 5     11         ...            ...           ...
                               11     101     27    1101     59   11101    
                               12     110     28    1110     60   11110    
                               13     111     29    1111     61   11111    

For the example file with 451,210 timestamps, this brings down the encoded file size to 676,418 bytes, or 11.99 bits per timestamp.

Testing the above method revealed that there were 98,578 single zeros and 31,271 single ones between larger intervals. So I tried using 1 bit of each larger interval to store whether it was followed by a zero, which decreased the encoded size to 592,315 bytes. And when I used 2 bits to store whether larger intervals were followed by 0, 1 or 00 (the most common sequence), the encoded size decreased to 564,034 bytes, or 10.0004 bits per timestamp.
I then changed to storing the single 0's and 1's with the following large interval instead of the preceding one (purely for reasons of code simplicity) and found that this resulted in a file size of 563.884 bytes, or 9.997722 bits per timestamp!

So the complete method is:

Store the first timestamp (8 bytes), then store the intervals as either: 

00 iiiiii                             sequences of up to 5 (or 6) zeros or ones
01 XXxxxx xxxxxxxx                     2-12 bit values (2 ~ 4,095)
10 XXxxxx xxxxxxxx xxxxxxxx           13-20 bit values (4,096 ~ 1,048,575)
11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx  21-28 bit values (1,048,576 ~ 268,435,455)

iiiiii = index in sequence table (see above)
XX     = preceded by a zero (if XX=1), a one (if XX=2) or two zeros (if XX=3)
xxx... = 12, 20 or 28 bit value

Example of an encoder:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

void write_timestamp(ofstream& ofile, uint64_t timestamp) {    // big-endian
    uint8_t bytes[8];
    for (int i = 7; i >= 0; i--, timestamp >>= 8) bytes[i] = timestamp;
    ofile.write((char*) bytes, 8);
}

int main() {
    ifstream ifile ("timestamps.txt");
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.bin", ios::trunc | ios::binary);
    if (! ofile.is_open()) return 2;

    long double seconds;
    uint64_t timestamp;

    if (ifile >> seconds) {
        timestamp = seconds * 1000000;
        write_timestamp(ofile, timestamp);
    }

    while (! ifile.eof()) {
        uint8_t bytesize = 0, len = 0, seq = 0, bytes[4];
        uint32_t interval;

        while (bytesize == 0 && ifile >> seconds) {
            interval = seconds * 1000000 - timestamp;
            timestamp += interval;

            if (interval < 2) {
                seq <<= 1; seq |= interval;
                if (++len == 5 && seq > 0 || len == 6) bytesize = 1;
            } else {
                while (interval >> ++bytesize * 8 + 4);
                for (uint8_t i = 0; i <= bytesize; i++) {
                    bytes[i] = interval >> (bytesize - i) * 8;
                }
                bytes[0] |= (bytesize++ << 6);
            }
        }
        if (len) {
            if (bytesize > 1 && (len == 1 || len == 2 && seq == 0)) {
                bytes[0] |= (2 * len + seq - 1) << 4;
            } else {
                seq += (1 << len) - 2;
                ofile.write((char*) &seq, 1);
            }
        }
        if (bytesize > 1) ofile.write((char*) bytes, bytesize);
    }
    ifile.close();
    ofile.close();
    return 0;
}

Example of a decoder:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

uint64_t read_timestamp(ifstream& ifile) {    // big-endian
    uint64_t timestamp = 0;
    uint8_t byte;
    for (uint8_t i = 0; i < 8; i++) {
        ifile.read((char*) &byte, 1);
        if (ifile.fail()) return 0;
        timestamp <<= 8; timestamp |= byte;
    }
    return timestamp;
}

uint8_t read_interval(ifstream& ifile, uint8_t *bytes) {
    uint8_t bytesize = 1;
    ifile.read((char*) bytes, 1);
    if (ifile.fail()) return 0;
    bytesize += bytes[0] >> 6;
    for (uint8_t i = 1; i < bytesize; i++) {
        ifile.read((char*) bytes + i, 1);
        if (ifile.fail()) return 0;
    }
    return bytesize;
}

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << seconds << "\n";
}

uint8_t write_sequence(ofstream& ofile, uint8_t seq, uint64_t timestamp) {
    uint8_t interval = 0, len = 1, offset = 1;
    while (seq >= (offset <<= 1)) {
        seq -= offset;
        ++len;
    }
    while (len--) {
        interval += (seq >> len) & 1;
        write_seconds(ofile, timestamp + interval);
    }
    return interval;
}

int main() {
    ifstream ifile ("timestamps.bin", ios::binary);
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.txt", ios::trunc);
    if (! ofile.is_open()) return 2;
    ofile.precision(6); ofile << std::fixed;

    uint64_t timestamp = read_timestamp(ifile);
    if (timestamp) write_seconds(ofile, timestamp);

    while (! ifile.eof()) {
        uint8_t bytes[4], seq = 0, bytesize = read_interval(ifile, bytes);
        uint32_t interval;

        if (bytesize == 1) {
            timestamp += write_sequence(ofile, bytes[0], timestamp);
        }
        else if (bytesize > 1) {
            seq = (bytes[0] >> 4) & 3;
            if (seq) timestamp += write_sequence(ofile, seq - 1, timestamp);
            interval = bytes[0] & 15;
            for (uint8_t i = 1; i < bytesize; i++) {
                interval <<= 8; interval += bytes[i];
            }
            timestamp += interval;
            write_seconds(ofile, timestamp);
        }
    }
    ifile.close();
    ofile.close();
    return 0;
}

Because of the long double output bug in the MinGW/gcc 4.8.1 compiler I'm using, I had to use this workaround: (this shouldn't be necessary with other compilers)

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << "1" << (double) (seconds - 1000000000) << "\n";
}

Note to future readers: this method is based on analysis of an example data file; it will not give the same compression rate if your data is different.

like image 136

If you need lossless compression at microsecond accuracy, then note that 10 bits will let you count to 1024.

If the timing of your events is random, and you actually need the microsecond accuracy that you spec'ed, this means your differential timestamps cannot have a variance of more than about 1 millisecond, without exceeding your 10 bit/event budget.

Based on a quick glance at your data, you probably can't quite make 10 bits/timestamp. However, your differencing is the right first step, and you can do better than 31 bits -- I would do statistics on a sample dataset, and choose a binary prefix encoding that reflects that distribution.

You should make sure your code has room to encode large gaps if necessary, so consider basing it on on a universal code.

like image 35
comingstorm Avatar answered Jan 10 '23 20:01

comingstorm