Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decompression and extraction of files from streaming archive on the fly

I'm writing a browser plugin, similiar to Flash and Java in that it starts downloading a file (.jar or .swf) as soon as it gets displayed. Java waits (I believe) until the entire jar files is loaded, but Flash does not. I want the same ability, but with a compressed archive file. I would like to access files in the archive as soon as the bytes necessary for their decompression are downloaded.

For example I'm downloading the archive into a memory buffer, and as soon as the first file is possible to decompress, I want to be able to decompress it (also to a memory buffer).

Are there any formats/libraries that support this?

EDIT: If possible, I'd prefer a single file format instead of separate ones for compression and archiving, like gz/bzip2 and tar.

like image 583
GhassanPL Avatar asked Jul 21 '09 15:07

GhassanPL


2 Answers

There are 2 issues here

  1. How to write the code.

  2. What format to use.

On the file format, You can't use the .ZIP format because .ZIP puts the table of contents at the end of the file. That means you'd have to download the entire file before you can know what's in it. Zip has headers you can scan for but those headers are not the official list of what's in the file.

Zip explicitly puts the table of contents at the end because it allows fast adding a files.

Assume you have a zip file with contains files 'a', 'b', and 'c'. You want to update 'c'. It's perfectly valid in zip to read the table of contents, append the new c, write a new table of contents pointing to the new 'c' but the old 'c' is still in the file. If you scan for headers you'll end up seeing the old 'c' since it's still in the file.

This feature of appending was an explicit design goal of zip. It comes from the 1980s when a zip could span multiple floppy discs. If you needed to add a file it would suck to have to read all N discs just to re-write the entire zip file. So instead the format just lets you append updated files to the end which means it only needs the last disc. It just reads the old TOC, appends the new files, writes a new TOC.

Gzipped tar files don't have this problem. Tar files are stored header, file, header file, and the compression is on top of that so it's possible to decompress as the file it's downloaded and use the files as they become available. You can create gzipped tar files easily in windows using winrar (commercial) or 7-zip (free) and on linux, osx and cygwin use the tar command.

On the code to write,

O3D does this and is open source so you can look at the code http://o3d.googlecode.com

The decompression code is in o3d/import/cross/...

It targets the NPAPI using some glue which can be found in o3d/plugin/cross

like image 51
gman Avatar answered Sep 19 '22 12:09

gman


Check out the boost::zlib filters. They make using zlib a snap.

Here's the sample from the boost docs that will decompress a file and write it to the console:

#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/zlib.hpp>

int main() 
{
    using namespace std;

    ifstream file("hello.z", ios_base::in | ios_base::binary);
    filtering_streambuf<input> in;
    in.push(zlib_decompressor());
    in.push(file);
    boost::iostreams::copy(in, cout);
}
like image 24
Matt Price Avatar answered Sep 21 '22 12:09

Matt Price