Add Files Into Existing Zip - performance issue

Tags:

I have a WCF webservice that saves files to a folder(about 200,000 small files). After that, I need to move them to another server.

The solution I've found was to zip them then move them.

When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast. But in production, zipping 200,000 files takes more than 2 hours.

Here is my code to zip the folder :

using (ZipFile zipFile = new ZipFile())
{
    zipFile.UseZip64WhenSaving = Zip64Option.Always;
    zipFile.CompressionLevel = CompressionLevel.None;
    zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);

    zipFile.Save(DestinationCurrentFileInfo.FullName);
}

I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.

I use the following code to test:

var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));

foreach (var additionFile in listAes)
{
    using (var zip = ZipFile.Read(nameOfExistingZip))
    {
        zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
        zip.AddFile(additionFile.FullName);

        zip.Save();
    }

    file.WriteLine("Delay for adding a file  : " + sw.Elapsed.TotalMilliseconds);
    sw.Restart();
}

The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.

Is there a way to optimize this ? Or if you have other suggestions ?

EDIT

The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file. As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?

629

asked May 13 '15 18:05

Anas

1 Answers

I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.

Reasoning

Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:

var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));

foreach (var additionFile in listAes)
{
    // (1)
    using (var zip = ZipFile.Read(nameOfExistingZip))
    {
        zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
        // (2)
        zip.AddFile(additionFile.FullName);

        // (3)
        zip.Save();
    }

    file.WriteLine("Delay for adding a file  : " + sw.Elapsed.TotalMilliseconds);
    sw.Restart();
}

(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file

On my computer this takes about an hour.

Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.

Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:

Zip file structure

You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:

Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.

Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.

A possible solution

In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.

Basically the algorithm then becomes:

Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).

Again, we can reason about it. The first 16 files aren't a problem, we've already established that.

We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.

Hacking in the ZIP file

Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.

Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.

I haven't worked this out, but it doesn't seem overly complicated to me.

The easiest solution is the most practical

What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.

Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:

static void Main()
{
    try { File.Delete(@"c:\tmp\test.zip"); }
    catch { }

    var sw = Stopwatch.StartNew();

    using (var zip = new ZipFile(@"c:\tmp\test.zip"))
    {
        zip.UseZip64WhenSaving = Zip64Option.Always;
        for (int i = 0; i < 200000; ++i)
        {
            string filename = "foo" + i.ToString() + ".txt";
            byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
            zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
            zip.AddEntry(filename, contents);
        }

        zip.Save();
    }

    Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
    Console.ReadLine();
}

Whop; that finishes in 4,5 seconds. Much better.

answered Sep 21 '22 00:09

atlaste

Related questions
                            
                                Autocompleting initializer with Resharper 6 in Visual Studio 2010
                            
                                Out Of Context Variables In Visual Studio 2010 Debugger
                            
                                Pass in an Expression to linq's Select
                            
                                Within the Containing Class, Use Property or Field?
                            
                                How do I find the spool file for the job with a given ID even when spool file pooling is enabled?
                            
                                How is it possible in this code: "ArgumentOutOfRangeException: startIndex cannot be larger than length of string"?
                            
                                Is there compile-time access to line numbers in C#?
                            
                                Does C# have an equivalent to decltype in C++11?
                            
                                SSL Certification works with localhost but not computer name or ip
                            
                                How to debug Unity resolution?
                            
                                How to get the current CPU/RAM/Disk usage in a C# web application using .NET CORE?
                            
                                Best practices to use await-async, where to start the task?
                            
                                async Task vs async void
                            
                                Assembly uses version X which has a higher version than referenced assembly error
                            
                                What is the best way to use Razor in a console application
                            
                                Need understanding as to why string.StartsWith() is true when it should be false
                            
                                RSA .NET encryption Java decryption
                            
                                Configure socket ACK timeout?
                            
                                Open Chrome from command line and wait till it's closed
                            
                                Having a collection in class [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Add Files Into Existing Zip - performance issue

Tags:

performance

c#

wcf

dotnetzip

Anas

People also ask

1 Answers

atlaste

Recent Activity

Donate For Us