Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compress array of numbers

I have a large array (~400.000.000 entries) with integers of {0, 1, ..., 8}. So I need 4 bits per entry. Around 200 MB.

At the moment I use a byte-array and save 2 numbers in each entry.

I wonder, if there is a good method, to compress this array. I did a quick research and found algorithms like Huffmann or LZW. But these algorithms are all for compressing the data, send the compressed data to someone and decompress them.

I just want to have a table, with less memory space, so I can load it into the RAM. The 200MB table easily fits, but I'm thinking on even bigger tables.

Important is, that I still be able to determine the values on certain positions.

Any tips?


Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.

I only need a fast getValue(idx) funktion, setValue(idx, value) is not important.

like image 968
Jakube Avatar asked Oct 15 '25 17:10

Jakube


1 Answers

It depends on how your data look like. Are there repeating entries, or do they change only slowly, or what?

If so, you can try to compress chunks of your data and decompress when needed. The bigger the chunks, the more memory you can save and the worse the speed. IMHO no good deal. You could also save the data compressed and decompress in memory.

Otherwise, i.e., in case of no regularities, you'll need at least log(9) / log(2) = 3.17 bits per entry and there's nothing what could improve it.

You can come pretty close to this value by packing 5 numbers into a short. As 9**5 = 59049 < 65536 = 2**16, it fits nearly perfectly. You'll need 3.2 bits per number, no big win. Packing of five number is given via this formula

a + 9 * (b + 9 * (c + 9 * (d + 9 * e)))

and unpacking is trivial via a precomputed table.

UPDATE after question update

Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.

The fact that there are on the average about 2.14 consecutive numbers are the same could lead to some compression, but in this case it says us nothing. There are nearly only fives and sixes, so encountering two equal consecutive numbers seems to be implied.

Given this facts, you can forget my above optimization. There are practically only 8 values there as you can treat the single zero separately. So you need just 3 bits per value and a single index for the zero.

You can even create a HashMap for all values below four or above seven, store there 1+154+10373+385990+80 entries and use only 2 bits per value. But this is still far from ideal.

Assuming no regularities, you'd need 1.44 bit per value as this is the entropy. You could go over all 5-tuples, compute their histogram, and use 1 byte for encoding of the 255 most frequent tuples. All the remaining tuples would map to the 256th value, telling you that you have to look in a HashMap for the rare tuple value.

Some evaluation

I was curious if it can work. The packing of 5 numbers into one byte needs 85996340 bytes. There are nearly 5 million tuples which don't fit, so my idea was to use a hash map for them. Assuming rehashing rather than chaining it makes sense to keep it maybe 50% full, so we need 10 million entries. Assuming TIntShortHashMap (mapping indexes to tuples) each entry takes 6 bytes, leading to 60 MB. Too bad.

Packing only 4 numbers into one byte consumes 107495425 bytes and leaves 159531 tuples which don't fit. This looks better, however, I'm sure the denser packing could be improved a lot.

The results as produced by this little program:

*** Packing 5 numbers in a byte. ***
Normal packed size: 85996340.
Number of tuples in need of special handling: 4813535.

*** Packing 4 numbers in a byte. ***
Normal packed size: 107495425.
Number of tuples in need of special handling: 159531.
like image 183
maaartinus Avatar answered Oct 17 '25 08:10

maaartinus