Compressing a vector of positive integers (int32) that have a specific order

Tags:

I'm trying to compress long vectors (their size ranges from 1 to 100 million elements). The vectors have positive integers with values ranging from 0 to 1 or 100 million (depending on the vector size). Hence, I'm using 32 bit integers to encompass the large numbers but that consumes too much storage. The vectors have the following characteristic features:

All values are positive integers. Their range grows as the vector size grows.
Values are increasing but smaller numbers do intervene frequently (see the figure below).
None of the values before a specific index are larger than that index (Index starts at zero). For instance, none of the values that occur before the index of 6 are larger than 6. However, smaller values may repeat after that index. This holds true for the entire array.
I'm usually dealing with very long arrays. Hence, as the array length passes 1 million elements, the upcoming numbers are mostly large numbers mixed with previous reoccurring numbers. Shorter numbers usually re-occur more than larger numbers. New Larger numbers are added to the array as you pass through it.

Here is a sample of the values in the array: {initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}

Here is a plot of a part of the vector:

What do I want? : Because I'm using 32 bit integers this is wasting a lot of memory since smaller numbers that can be represented with less than 32 bit do repeat too. I want to compress this vector maximally to save memory (Ideally, by a factor of 3 because only a reduction by that amount or more will meet our needs!). What is the best compression algorithm to achieve that? Or is there away to take advantage of the array's characteristic features described above to reversibly convert the numbers in that array to 8 bit integers?

Things that I have tried or considered:

Delta encoding: This doesn't work here because the vector is not always increasing.
Huffman coding: Does not seem to help here since the range of unique numbers in the array is quite large, hence, the encoding table will be a large overhead.
Using variable Int encoding. i.e using 8 bit integers for smaller numbers and 16 bit for larger ones...etc. This has reduced the vector size to size*0.7 (not satisfactory since it doesn't take advantage of the specific characteristics described above)
I'm not quite sure if this method described in the following link is applicable to my data: http://ygdes.com/ddj-3r/ddj-3r_compact.html I don't quite understand the method but it gives me the encouragement to try similar things because I think there is some order in the data that can be taken to its advantage. For example, I tried to reassign any number(n) larger than 255 to n-255 so that I can keep the integers in 8 bit realm because I know that no number is larger than 255 before that index. However, I'm not able to distinguish the reassigned numbers with the repeated numbers... so this idea doesn't work unless doing some more tricks to reverse the re-assignments...

Here is the link to the fist 24000 elements of the data for those interested: data

Any advice or suggestions are deeply appreciated. Thanks a lot in advance.

Edit1:

Here is a plot of the data after delta encoding. As you can see, it doesn't reduce the range! delta encoded

Edit2:

I was hoping that I could find a pattern in the data that allows me to reversibly change the 32-bit vector to a single 8-bit vector but this seems very unlikely. I have tried to decompose the 32-bit vector to 4 x 8-bit vectors, hoping that the decomposed vectors lend themselves to compression better. Below are plots for the 4 vectors. Now their ranges are from 0-255. What I did was to recursively divide each element in the vectors by 255 and store the reminder into another vector. To reconstruct the original array all I need to do is: ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...

decomposed arrays

As you can see, the last vector is all zeros for the current shown length of the data.. in fact, this should be zeros all the way to 2^24th element. This will be a 25% reduction if my total vector length was less than 16 million elements but since I'm dealing with much longer vectors this has a much smaller impact. More importantly, the third vector seems also to have some compressible features as its values do increase by 1 after each 65,535 steps. It does seem that now I can benefit from Huffman coding or variable bit encoding as suggested. Any suggestions that allows me to maximally compress this data are deeply appreciated. Here I attached a bigger sample of the data if anyone is interested:

https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing

Edit3:

I'm really thankful for all the given answers. I've learnt a lot from them. For those of you who are interested to tinker with a larger set of the data the following link has 11 million elements of a similar dataset (zipped 33MB)

https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view

Once you unzip the data, you can use the following C++ snippet to read the data into a vector<int32_t>

    const char* path = "path_to\compression_int32.txt";
    std::vector<int32_t> newVector{};
    std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
    std::istream_iterator<int32_t> iter{ ifs };
    std::istream_iterator<int32_t> end{};
    std::copy(iter, end, std::back_inserter(newVector));

669

asked Jun 11 '21 20:06

user3134575

1 Answers

It's easy to get better than a factor of two compression on your example data by using property 3, where I have taken property 3 to mean that every value must be less than its index, with the indices starting at 1. Simply use ceiling(log₂(i)) bits to store the number at index i (where i starts at 1). For your first example with 24,977 values, that compresses it of 43% of the size of the vector using 32-bit integers.

The number of bits required depends only on the length of the vector, n. The number of bits is:

1 - 2^{ceiling(log₂(n))} + n ceiling(log₂(n))

As noted by Falk Hüffner, a simpler approach would be a fixed number of bits for all values of ceiling(log₂(n)). A variable number of bits will always be less than that, but not much less than that for large n.

If it is common to have a run of zeros at the start, then compress those with a count. There are only a handful of runs of two or three numbers in the remainder, so run-length encoding won't help except for that initial run of zeros.

Another 2% or so (for large sets) could be shaved off using an arithmetic coding approach, considering each value at index k (indices starting at zero) to be a base k+1 digit of a very large integer. That would take ceiling(log₂(n!)) bits.

Here is a plot of the compression ratios of the arithmetic coding, variable bits per sample coding, and fixed bits per sample coding, all ratioed to a representation with 32 bits for every sample (the sequence length is on a log scale):

arithmetic better than variable better than fixed

The arithmetic approach requires multiplication and division on integers the length of the compressed data, which is monumentally slow for large vectors. The code below limits the size of the integers to 64 bits, at some cost to the compression ratio, in exchange for it being very fast. This code will give compression ratios about 0.2% to 0.7% more than arithmetic in the plot above, well below variable bits. The data vector must have the property that each value is non-negative and that each value is less than its position (positions starting at one). The compression effectiveness depends only on that property, plus a small reduction if there is an initial run of zeros. There appears to be a bit more redundancy in the provided examples that this compression approach does not exploit.

#include <vector>
#include <cmath>

// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
    while (val > 0x7f) {
        comp.push_back(val & 0x7f);
        val >>= 7;
    }
    comp.push_back(val | 0x80);
}

// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
    T val = 0, next;
    int shift = 0;
    for (;;) {
        next = comp.at(off++);
        if (next > 0x7f)
            break;
        val |= next << shift;
        shift += 7;
    }
    val |= (next & 0x7f) << shift;
    return val;
}

// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
    // Analyze all of the permitted groups, starting at index i.
    double min = 1.;
    uint64_t k = 1;                 // integer range is 0..k-1
    auto j = i + 1;
    do {
        k *= j;
        auto e = log2(k);           // entropy of k possible integers
        int b = ceil(e);            // number of bits to hold 0..k-1
        auto loss = (b - e) / b;    // unused entropy per bit
        if (loss < min) {
            num = j - i;            // best number of values so far
            len = b;                // bit length for that number
            if (loss == 0.)
                break;              // not going to get any better
            min = loss;
        }
    } while (j < n && k <= (uint64_t)-1 / ++j);
}

// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
    // Resulting compressed data vector.
    std::vector<uint8_t> comp;

    // Start with number of values to make the stream self-terminating.
    write_varint(data.size(), comp);
    if (data.size() == 0)
        return comp;

    // Run-length code the initial run of zeros. Write the number of contiguous
    // zeros after the first one.
    size_t i = 1;
    while (i < data.size() && data[i] == 0)
        i++;
    write_varint(i - 1, comp);

    // Compress the data into variable-base integers starting at index i, where
    // each integer fits into 64 bits.
    unsigned buf = 0;       // output bit buffer
    int bits = 0;           // number of bits in buf (0..7)
    while (i < data.size()) {
        // Find the optimal number of values to code, starting at index i.
        size_t num;  int len;
        group_ar64(i, data.size(), num, len);

        // Code num values.
        uint64_t code = 0;
        size_t k = 1;
        do {
            code += k * data[i++];
            k *= i;
        } while (--num);

        // Write code using len bits.
        if (bits) {
            comp.push_back(buf | (code << bits));
            code >>= 8 - bits;
            len -= 8 - bits;
        }
        while (len > 7) {
            comp.push_back(code);
            code >>= 8;
            len -= 8;
        }
        buf = code;
        bits = len;
    }
    if (bits)
        comp.push_back(buf);
    return comp;
}

// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
    // Will contain the uncompressed data to return.
    std::vector<T> data;

    // Get the number of values.
    auto vals = read_varint<size_t>(comp, off);
    if (vals == 0)
        return data;

    // Get the number of zeros after the first one, and write all of them.
    auto run = read_varint<size_t>(comp, off) + 1;
    auto i = run;
    do {
        data.push_back(0);
    } while (--run);

    // Extract the values from the compressed data starting at index i.
    unsigned buf = 0;       // input bit buffer
    int bits = 0;           // number of bits in buf (0..7)
    while (i < vals) {
        // Find the optimal number of values to code, starting at index i. This
        // simply repeats the same calculation that was done when compressing.
        size_t num;  int len;
        group_ar64(i, vals, num, len);

        // Read len bits into code.
        uint64_t code = buf;
        while (bits + 8 < len) {
            code |= (uint64_t)comp.at(off++) << bits;
            bits += 8;
        }
        len -= bits;                    // bits to pull from last byte (1..8)
        uint64_t last = comp.at(off++); // last byte
        code |= (last & ((1 << len) - 1)) << bits;
        buf = last >> len;              // save remaining bits in buffer
        bits = 8 - len;

        // Extract num values from code.
        do {
            i++;
            data.push_back(code % i);
            code /= i;
        } while (--num);
    }

    // Return the uncompressed data.
    return data;
}

101

answered Oct 24 '22 08:10

Mark Adler

Related questions
                            
                                Finding centre of rotation for a set of points [closed]
                            
                                Optimization from partial solution: minimize sum of distances between pairs
                            
                                Algorithm to compare similarity of English sentences
                            
                                Python Implementations of Packing Algorithm
                            
                                Clarification of statement of performance of collection's binary search from javadoc
                            
                                Shortest uncommon substring: shortest substring of one string, that is not a substring of another string
                            
                                In-place interleaving of the two halves of a string
                            
                                How to find the subarray that has sum closest to zero or a certain value t in O(nlogn)
                            
                                Finding middle element of linked list with 1 pass, is this a creative "useless answer"?
                            
                                Efficient manipulation of a list of cartesian coordinates in Python
                            
                                Count number of subsets with sum equal to k
                            
                                Knapsack with multiple bags and items having only weight
                            
                                Precise subpixel line drawing algorithm (rasterization algorithm)
                            
                                Checking if given preorder traversal is valid BST
                            
                                C++ Difference between std::lower_bound and std::set::lower_bound?
                            
                                Find permutations by repeatedly cycling 3 elements
                            
                                Dijkstra with Parallel edges and self-loop
                            
                                how to make StdIn.isEmpty() return true?
                            
                                Better "centerpoint" than centroid
                            
                                Generating all possible "unique" RPN (Reverse Polish notation) expressions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compressing a vector of positive integers (int32) that have a specific order

Tags:

algorithm

integer

bit-manipulation

compression

user3134575

People also ask

1 Answers

Mark Adler

Recent Activity

Donate For Us