Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compress sorted integers

I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.

like image 815
Daniel Avatar asked Feb 07 '09 13:02

Daniel


People also ask

Can numbers be compressed?

The best way to compress a large number is to write a short program that generates it - assuming that the number is one whose properties are well known to you. If you're asking how to compress any large number, that's simply not possible. Every sequence of data is not compressible.

How does deflate compression work?

The DEFLATE compressed data format consists of a series of blocks, corresponding to successive blocks of input data. Each block is compressed using a combination of the LZ77 algorithm and Huffman coding .

How do compression algorithms work?

Compression algorithms reduce the number of bytes required to represent data and the amount of memory required to store images. Compression allows a larger number of images to be stored on a given medium and increases the amount of data that can be sent over the internet.


1 Answers

If you are storing integers which are close together (eg: 1, 3 ,4, 5, 9, 10 etc... ) rather than some random 32 bit integers (982346..., 3487623412.., etc) you can do one thing:

Find the differences between the adjacent numbers which would be like 2,1,1,4,1... etc.(in our example) and then Huffman encode this numbers.

I don't think Huffman encoding will work if you directly apply them to the original list of numbers you have.

But if you have a sorted list of near-by numbers, the odds are good that you will get a very good compression ratio by doing Huffman encoding of the number differences, may be better ratio than using the LZW algorithm used in the Zip libraries.

Anyway thanks for posting this interesting question.

like image 131
Niyaz Avatar answered Sep 18 '22 06:09

Niyaz