Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random access gzip stream

I'd like to be able to do random access into a gzipped file. I can afford to do some preprocessing on it (say, build some kind of index), provided that the result of the preprocessing is much smaller than the file itself.

Any advice?

My thoughts were:

  • Hack on an existing gzip implementation and serialize its decompressor state every, say, 1 megabyte of compressed data. Then to do random access, deserialize the decompressor state and read from the megabyte boundary. This seems hard, especially since I'm working with Java and I couldn't find a pure-java gzip implementation :(
  • Re-compress the file in chunks of 1Mb and do same as above. This has the disadvantage of doubling the required disk space.
  • Write a simple parser of the gzip format that doesn't do any decompressing and only detects and indexes block boundaries (if there even are any blocks: I haven't yet read the gzip format description)
like image 846
jkff Avatar asked Mar 26 '10 21:03

jkff


1 Answers

FWIW: I've developed a command line tool upon zlib's zran.c source code which can do random access to gzip with the creation of indexes for gzip files: https://github.com/circulosmeos/gztool

It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

like image 152
circulosmeos Avatar answered Oct 20 '22 15:10

circulosmeos