Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing / random access to 7zip .7z archives

Tools exist to provide random access to gzip and bzip2 archives:

  • gzip zran.c from the ghostscript source code
  • bzip2 seek-bzip by James Taylor

I'm looking for any similar solution for 7zip

(The goal is to utilize the sometimes gigantic Wikipedia dump files offline without keeping decompressed copies around)

like image 416
hippietrail Avatar asked Dec 21 '22 20:12

hippietrail


1 Answers

I tought it's better to summarize GZIP, BZIP2 and LZMA internals to make clear something:

  1. GZIP is actually a format which uses Deflate algorithm. Due to static huffman codes (deflate documents also mentions about dynamic huffman, but actually they are static too) deflate should be encoded as block-wise (sliding window is another term in here). zran.c seems to find those blocks boundaries and tries to decode up to 2 consecutive blocks at most which could contain a few KiB uncompressed data for collecting enough data to decompress (to fill entire 32 KiB window). So, random access is quite possible even without index table.

  2. BZIP2 is actually a BWT class compression algorithm. And due to BWT nature, it's no wonder that it's block-wise. It's blocks limited up to 900 KiB for each individual blocks. Also, blocks boundaries are well defined to easy recovering process (has huge distinct markers). So, you can even use multiple threads at once to decompress all data. In another words, random access is quite possible even without any table (it's already supported by format).

  3. LZMA supports up to 1 GiB dictionary and it's not block-wise encoded. It uses a range coder to encode probabilities rather than huffman coder. Even considering 64 MiB window size (very common value), due to range coder nature we can't simply decode at a given random point until fill-up entire window. Also, LZMA's state machine can be bothersome too. So, it's implementation is quite hard or even impossible.

Maybe LZMA2 or PPM methods can be used for such usages (7-zip supports them as well within 7-zip format). PPM flushes it's model when it's statistics are full and LZMA2 intentionally flushes some state at some interval to enable multi-threaded decompression. Their random access implementation can be possible.

like image 91
Osman Turan Avatar answered Jan 07 '23 10:01

Osman Turan