Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple API for random access into a compressed data file

Please recommend a technology suitable for the following task.

I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.

What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.

Of course, I hope to use existing library rather than reinvent the NIH wheel.

like image 661
Pavel Radzivilovsky Avatar asked Jul 01 '10 16:07

Pavel Radzivilovsky


4 Answers

If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.

like image 125
jkff Avatar answered Nov 01 '22 19:11

jkff


Byte Pair Encoding allows random access to data.

You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.

However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.

One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.

If it's accessed rarely, you could compress the index with gzip and decode it when needed.

like image 26
Marcus Adams Avatar answered Nov 01 '22 20:11

Marcus Adams


If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).

Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).

Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).

like image 1
Jerry Coffin Avatar answered Nov 01 '22 18:11

Jerry Coffin


Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.

It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.

Tipical usage:

dzip -j4 myfile

...

CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);

It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio

like image 1
hoxnox Avatar answered Nov 01 '22 20:11

hoxnox