Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing a set of large integers

I have a set of integers, for which I would like to have the most compact representation. I have the following constraints/features:

  • it is set, or in other words, a list of unique integers in which the order does not matter
  • the size of the set L is relatively small (typically 1000 elements)
  • the integers follow a uniform distribution between 0 and N-1, with N relatively large (say 2^32)
  • the access to the elements of the compressed set are random, but it is fine if the decompression procedure is not so fast
  • the compression should be lossless, obviously

I have tried a few things, but I am not satisfied by the results, and I am somehow convinced that a better solution exists:

  • delta encoding (sorting, then encoding differences), or also sorting, then encoding differences between the i-th element and i*N/L. Both give reasonable results, but not great, probably because of the typical sizes of N and L. Huffman coding the deltas does not help because they are typically big.
  • recursive range reduction ( http://ygdes.com/ddj-3r/ddj-3r_compact.html ). This seems smart, but works best on exponentially decreasing integers, which is definitely not the case here.
  • a few discussions here on stackoverflow are similar but not completely equivalent to my problem ( C Library for compressing sequential positive integers, Compress sorted integers )

I would be glad to hear any ideas you might have. Thanks in advance!

UPDATE:

It turns out that delta encoding seems to approximate the optimal solution. This might be different for other other distributions of the elements in the set.

like image 757
doc Avatar asked Aug 28 '12 10:08

doc


People also ask

How do I compress a large number?

The best way to compress a large number is to find its factors and either make it into a square root, or make it into a Base number with an exponent…. You have a stream of bytes to send over the net. But if you compress it, you will be charged less for less data usage.

How is it possible to compress data?

Data compression is a process in which the size of a file is reduced by re-encoding the file data to use fewer bits of storage than the original file. A fundamental component of data compression is that the original file can be transferred or stored, recreated, and then used later (with a process called decompression).

How does deflate compression work?

The format can be implemented readily in a manner not covered by patents. The DEFLATE compressed data format consists of a series of blocks, corresponding to successive blocks of input data. Each block is compressed using a combination of the LZ77 algorithm and Huffman coding .

How does Zip file compression work?

ZIP files work in much the same way as a standard folder on your computer. They contain data and files together in one place. But with zipped files, the contents are compressed, which reduces the amount of data used by your computer. Another way to describe ZIP files is as an archive.


1 Answers

You can get an idea of the best you can do by counting. (I wish stackoverflow allowed TeX equations like math.stackexchange. Anyway ...)

ceiling(log(Combination(2^32,1000)) / (8 * log(2))) = 2934

So if, as you say, the choices are uniformly distributed, the best compression you could hope for on average for that particular case is 2934 bytes. The best ratio is 73.35% of the unencoded representation of 4000 bytes.

Combination(2^32,1000) is simply the total number of possible inputs to the compression algorithm. If those are uniformly distributed, then the optimal coding is one giant integer that identifies each possible input by an index. Each giant integer value uniquely identifies one of the inputs. Imagine looking up the input by index in a giant table. ceiling(log(Combination(2^32,1000)) / log(2)) is how many bits you need for that index integer.

Update:

I found a way to get close to the theoretical best using off-the-shelf compression tools. I sort, apply delta coding, and subtract one from that (since the delta between successive distinct elements is at least one). Then the trick is that I write out all the high bytes, then the next most significant bytes, etc. The high bytes of the deltas minus one tend to be zero, so that groups a lot of zeros together, which the standard compression utilities love. Also the next set of bytes tend to be biased to low values.

For the example (1000 uniform and distinct samples from 0..2^32-1), I get an average of 3110 bytes when running that through gzip -9, and 3098 bytes through xz -9 (xz uses the same compression, LZMA, as 7zip). Those are pretty close to the theoretical best average of 2934. Also gzip has an overhead of 18 bytes, and xz has an overhead of 24 bytes, both for headers and trailers. So a fairer comparison with the theoretical best would be 3092 for gzip -9 and 3074 for xz -9. Around 5% larger than the theoretical best.

Update 2:

I implemented direct encoding of the permutations, and achieved an average of 2974 bytes, which is only a little over 1% more than the theoretical best. I used the GNU multiple precision arithmetic library to encode an index for each permutation in a giant integer. The actual code for the encoding and decoding is shown below. I added comments for the mpz_* functions where it may not be obvious from the name what arithmetic operations they're doing.

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}

There are mpz_* functions for writing the number to a file and reading it back, or exporting the number to a specified format in memory, and importing it back.

like image 153
12 revs Avatar answered Sep 18 '22 14:09

12 revs