Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you pre-compress data files to be inserted into a zip file at a later time to improve performance?

As part of our installer build, we have to zip thousands of large data files into about ten or twenty 'packages' with a few hundred (or even thousands of) files in each which are all dependent on being kept with the other files in the package. (They are versioned together if you will.)

Then during the actual install, the user selects which packages they want included on their system. This also lets them download updates to the packages from our site as one large, versioned file rather than asking them to download thousands of individual ones which could also lead to them being out of sync with others in the same package.

Since these are data files, some of them change regularly during the design and coding stages, meaning we then have to re-compress all files in that particular zip package, even if only one file has changed. This makes the packaging step of our installer build take well over an hour each time, with most of that going to re-compressing things that we haven't touched.

We've looked into leaving the zip packages alone, then replacing specific files inside them, but inserting and removing large files from the middle of a zip doesn't give us that much of a performance boost. (A little, but not enough that its worth it.)

I'm wondering if its possible to pre-process files down into a cached raw 'compressed state' that matches how it would be written to the zip package, but only the data itself, not the zip header info, etc.

My thinking is if that is possible, during our build step, we would first look for any data file that doesn't have a compressed cache associated with it, and if not, we would compress that file and write the result to the cache.

Next we would simply append all of the caches together in a file stream, adding any appropriate zip header needed for the files.

This would mean we are still recreating the entire zip during each build, but we are only recompressing data that has changed. The rest would just be written as-is which is very fast since it is a straight write-to-disk. And if a data file changes, its cache is destroyed, so next build-pass it would be recreated.

However, I'm not sure such a thing is possible. Is it, and if so, is there any documentation to show how one would go about attempting this?

like image 863
Mark A. Donohoe Avatar asked Oct 18 '13 16:10

Mark A. Donohoe


People also ask

Does a zip file compress data?

Zipped (compressed) files take up less storage space and can be transferred to other computers more quickly than uncompressed files. In Windows, you work with zipped files and folders in the same way that you work with uncompressed files and folders.

Does zipping a file make it upload faster?

Another advantage of zip files is that they are compressed, which means the total file size is smaller. If you're emailing a zip file to someone or posting it to the Web, it takes less time to upload—and your recipients will also be able to download it more quickly.

What is the best zip compression method?

Compression level — the compression time increases with the compression level. The presets range from Store (fastest compression) to Ultra (slowest compression time with the most space saved). Compression method — select the LZMA option as the best method for handling the compression process.


1 Answers

Yes, that's possible. The most straightforward approach would be to zip each file individually into its own associated zip archive with one entry. When any file is modified, you replace its associated zip file to keep all of those up to date. Then you can write a simple program to take a set of those single entry zip files and merge them into a single zip file. You will need to refer to the documentation in the PKZip appnote. Take a look at that.

Now that you've read the appnote, what you need to do is use the local header, data, and central header from each individual zip file, write the local header and data as is sequentially to the new zip file, and save the central header and the offsets of the local headers in the new file. Then at the end of the new file save the current offset, write a new central directory using the central headers you saved, updating the offsets appropriately, and ending with a new end of central directory record with the offset of the start of the central directory.

Update:

I decided this was a useful enough thing to write. You can get it here.

like image 55
Mark Adler Avatar answered Sep 27 '22 17:09

Mark Adler