Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a gz file and keeping track of position in file

Tags:

java

io

So, here is the situation:

I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access. In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)

So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.

Is there anything available or do I have to roll my own?

A few additional comments:

  • I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
  • I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
  • I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8

I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).

Thanks for tips,

Arnaud

like image 449
dagnelies Avatar asked Oct 11 '22 14:10

dagnelies


1 Answers

I think jzran could be pretty much what you're looking for:

It's a Java library based on the zran.c sample from zlib.

You can preprocess a large gzip archive, producing an "index" that can be used for random read access.

You can balance between index size and access speed.

like image 124
schnaader Avatar answered Oct 15 '22 11:10

schnaader