Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing a vector of positive integers (int32) that have a specific order

I'm trying to compress long vectors (their size ranges from 1 to 100 million elements). The vectors have positive integers with values ranging from 0 to 1 or 100 million (depending on the vector size). Hence, I'm using 32 bit integers to encompass the large numbers but that consumes too much storage. The vectors have the following characteristic features:

  1. All values are positive integers. Their range grows as the vector size grows.
  2. Values are increasing but smaller numbers do intervene frequently (see the figure below).
  3. None of the values before a specific index are larger than that index (Index starts at zero). For instance, none of the values that occur before the index of 6 are larger than 6. However, smaller values may repeat after that index. This holds true for the entire array.
  4. I'm usually dealing with very long arrays. Hence, as the array length passes 1 million elements, the upcoming numbers are mostly large numbers mixed with previous reoccurring numbers. Shorter numbers usually re-occur more than larger numbers. New Larger numbers are added to the array as you pass through it.

Here is a sample of the values in the array: {initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}

Here is a plot of a part of the vector: Here is a plot of a part of the vector:

What do I want? : Because I'm using 32 bit integers this is wasting a lot of memory since smaller numbers that can be represented with less than 32 bit do repeat too. I want to compress this vector maximally to save memory (Ideally, by a factor of 3 because only a reduction by that amount or more will meet our needs!). What is the best compression algorithm to achieve that? Or is there away to take advantage of the array's characteristic features described above to reversibly convert the numbers in that array to 8 bit integers?

Things that I have tried or considered:

  1. Delta encoding: This doesn't work here because the vector is not always increasing.
  2. Huffman coding: Does not seem to help here since the range of unique numbers in the array is quite large, hence, the encoding table will be a large overhead.
  3. Using variable Int encoding. i.e using 8 bit integers for smaller numbers and 16 bit for larger ones...etc. This has reduced the vector size to size*0.7 (not satisfactory since it doesn't take advantage of the specific characteristics described above)
  4. I'm not quite sure if this method described in the following link is applicable to my data: http://ygdes.com/ddj-3r/ddj-3r_compact.html I don't quite understand the method but it gives me the encouragement to try similar things because I think there is some order in the data that can be taken to its advantage. For example, I tried to reassign any number(n) larger than 255 to n-255 so that I can keep the integers in 8 bit realm because I know that no number is larger than 255 before that index. However, I'm not able to distinguish the reassigned numbers with the repeated numbers... so this idea doesn't work unless doing some more tricks to reverse the re-assignments...

Here is the link to the fist 24000 elements of the data for those interested: data

Any advice or suggestions are deeply appreciated. Thanks a lot in advance.

Edit1:

Here is a plot of the data after delta encoding. As you can see, it doesn't reduce the range! delta encoded

Edit2:

I was hoping that I could find a pattern in the data that allows me to reversibly change the 32-bit vector to a single 8-bit vector but this seems very unlikely. I have tried to decompose the 32-bit vector to 4 x 8-bit vectors, hoping that the decomposed vectors lend themselves to compression better. Below are plots for the 4 vectors. Now their ranges are from 0-255. What I did was to recursively divide each element in the vectors by 255 and store the reminder into another vector. To reconstruct the original array all I need to do is: ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...

decomposed arrays

As you can see, the last vector is all zeros for the current shown length of the data.. in fact, this should be zeros all the way to 2^24th element. This will be a 25% reduction if my total vector length was less than 16 million elements but since I'm dealing with much longer vectors this has a much smaller impact. More importantly, the third vector seems also to have some compressible features as its values do increase by 1 after each 65,535 steps. It does seem that now I can benefit from Huffman coding or variable bit encoding as suggested. Any suggestions that allows me to maximally compress this data are deeply appreciated. Here I attached a bigger sample of the data if anyone is interested:

https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing

Edit3:

I'm really thankful for all the given answers. I've learnt a lot from them. For those of you who are interested to tinker with a larger set of the data the following link has 11 million elements of a similar dataset (zipped 33MB)

https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view

Once you unzip the data, you can use the following C++ snippet to read the data into a vector<int32_t>

    const char* path = "path_to\compression_int32.txt";
    std::vector<int32_t> newVector{};
    std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
    std::istream_iterator<int32_t> iter{ ifs };
    std::istream_iterator<int32_t> end{};
    std::copy(iter, end, std::back_inserter(newVector));
like image 669
user3134575 Avatar asked Jun 11 '21 20:06

user3134575


People also ask

What is the logical range of Int32?

Int32 means you have 32 bits available to store your number. The highest bit is the sign-bit, this indicates if the number is positive or negative. So you have 2^31 bits for positive and negative numbers. With zero being a positive number you get the logical range of (mentioned before) +2147483647 to -2147483648.

What is meant by the Order of integers?

Ordering of integers is stated for the series or a sequence where numbers are arranged in an order. The integers are ordered on a number line based on positive integers and negative integers. The integers that are greater than 0 are positive integers and integers less than 0 are negative integers.

What is the largest negative 32-bit integer?

@Aaren Cordova They used to say stackoverflow will never be funny, be nothing more than a Q&A site, I generally point them to this answer. This thing can only be created inside a genius mind, I mean this is Art. The largest negative 32 bit integer, or 64 bit for that matter, is -1.

Is it possible to perform operations on complex numbers with integers?

Operations involving complex numbers with integer types is not supported. For each integer data type, there is a largest and smallest number that you can represent with that type. The table shown under Integers lists the largest and smallest values for each integer data type in the “Range of Values” column.


1 Answers

It's easy to get better than a factor of two compression on your example data by using property 3, where I have taken property 3 to mean that every value must be less than its index, with the indices starting at 1. Simply use ceiling(log2(i)) bits to store the number at index i (where i starts at 1). For your first example with 24,977 values, that compresses it of 43% of the size of the vector using 32-bit integers.

The number of bits required depends only on the length of the vector, n. The number of bits is:

1 - 2ceiling(log2(n)) + n ceiling(log2(n))

As noted by Falk Hüffner, a simpler approach would be a fixed number of bits for all values of ceiling(log2(n)). A variable number of bits will always be less than that, but not much less than that for large n.

If it is common to have a run of zeros at the start, then compress those with a count. There are only a handful of runs of two or three numbers in the remainder, so run-length encoding won't help except for that initial run of zeros.

Another 2% or so (for large sets) could be shaved off using an arithmetic coding approach, considering each value at index k (indices starting at zero) to be a base k+1 digit of a very large integer. That would take ceiling(log2(n!)) bits.

Here is a plot of the compression ratios of the arithmetic coding, variable bits per sample coding, and fixed bits per sample coding, all ratioed to a representation with 32 bits for every sample (the sequence length is on a log scale):

arithmetic better than variable better than fixed

The arithmetic approach requires multiplication and division on integers the length of the compressed data, which is monumentally slow for large vectors. The code below limits the size of the integers to 64 bits, at some cost to the compression ratio, in exchange for it being very fast. This code will give compression ratios about 0.2% to 0.7% more than arithmetic in the plot above, well below variable bits. The data vector must have the property that each value is non-negative and that each value is less than its position (positions starting at one). The compression effectiveness depends only on that property, plus a small reduction if there is an initial run of zeros. There appears to be a bit more redundancy in the provided examples that this compression approach does not exploit.

#include <vector>
#include <cmath>

// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
    while (val > 0x7f) {
        comp.push_back(val & 0x7f);
        val >>= 7;
    }
    comp.push_back(val | 0x80);
}

// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
    T val = 0, next;
    int shift = 0;
    for (;;) {
        next = comp.at(off++);
        if (next > 0x7f)
            break;
        val |= next << shift;
        shift += 7;
    }
    val |= (next & 0x7f) << shift;
    return val;
}

// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
    // Analyze all of the permitted groups, starting at index i.
    double min = 1.;
    uint64_t k = 1;                 // integer range is 0..k-1
    auto j = i + 1;
    do {
        k *= j;
        auto e = log2(k);           // entropy of k possible integers
        int b = ceil(e);            // number of bits to hold 0..k-1
        auto loss = (b - e) / b;    // unused entropy per bit
        if (loss < min) {
            num = j - i;            // best number of values so far
            len = b;                // bit length for that number
            if (loss == 0.)
                break;              // not going to get any better
            min = loss;
        }
    } while (j < n && k <= (uint64_t)-1 / ++j);
}

// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
    // Resulting compressed data vector.
    std::vector<uint8_t> comp;

    // Start with number of values to make the stream self-terminating.
    write_varint(data.size(), comp);
    if (data.size() == 0)
        return comp;

    // Run-length code the initial run of zeros. Write the number of contiguous
    // zeros after the first one.
    size_t i = 1;
    while (i < data.size() && data[i] == 0)
        i++;
    write_varint(i - 1, comp);

    // Compress the data into variable-base integers starting at index i, where
    // each integer fits into 64 bits.
    unsigned buf = 0;       // output bit buffer
    int bits = 0;           // number of bits in buf (0..7)
    while (i < data.size()) {
        // Find the optimal number of values to code, starting at index i.
        size_t num;  int len;
        group_ar64(i, data.size(), num, len);

        // Code num values.
        uint64_t code = 0;
        size_t k = 1;
        do {
            code += k * data[i++];
            k *= i;
        } while (--num);

        // Write code using len bits.
        if (bits) {
            comp.push_back(buf | (code << bits));
            code >>= 8 - bits;
            len -= 8 - bits;
        }
        while (len > 7) {
            comp.push_back(code);
            code >>= 8;
            len -= 8;
        }
        buf = code;
        bits = len;
    }
    if (bits)
        comp.push_back(buf);
    return comp;
}

// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
    // Will contain the uncompressed data to return.
    std::vector<T> data;

    // Get the number of values.
    auto vals = read_varint<size_t>(comp, off);
    if (vals == 0)
        return data;

    // Get the number of zeros after the first one, and write all of them.
    auto run = read_varint<size_t>(comp, off) + 1;
    auto i = run;
    do {
        data.push_back(0);
    } while (--run);

    // Extract the values from the compressed data starting at index i.
    unsigned buf = 0;       // input bit buffer
    int bits = 0;           // number of bits in buf (0..7)
    while (i < vals) {
        // Find the optimal number of values to code, starting at index i. This
        // simply repeats the same calculation that was done when compressing.
        size_t num;  int len;
        group_ar64(i, vals, num, len);

        // Read len bits into code.
        uint64_t code = buf;
        while (bits + 8 < len) {
            code |= (uint64_t)comp.at(off++) << bits;
            bits += 8;
        }
        len -= bits;                    // bits to pull from last byte (1..8)
        uint64_t last = comp.at(off++); // last byte
        code |= (last & ((1 << len) - 1)) << bits;
        buf = last >> len;              // save remaining bits in buffer
        bits = 8 - len;

        // Extract num values from code.
        do {
            i++;
            data.push_back(code % i);
            code /= i;
        } while (--num);
    }

    // Return the uncompressed data.
    return data;
}
like image 101
Mark Adler Avatar answered Oct 24 '22 08:10

Mark Adler