Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel.ForEach throws exception when extracting a zip file

I am reading the contents of a zip file and trying to extract them.

  var allZipEntries = ZipFile.Open(zipFileFullPath, ZipArchiveMode.Read).Entries;

Now if I extract the using Foreach loop this works fine. The drawback is it is equivalent of zip.extract method and I am not getting any advantage when intend to extract all the files.

   foreach (var currentEntry in allZipEntries)
        {
            if (currentEntry.FullName.Equals(currentEntry.Name))
            {
                currentEntry.ExtractToFile($"{tempPath}\\{currentEntry.Name}");
            }
            else
            {
                var subDirectoryPath = Path.Combine(tempPath, Path.GetDirectoryName(currentEntry.FullName));
                Directory.CreateDirectory(subDirectoryPath);
                currentEntry.ExtractToFile($"{subDirectoryPath}\\{currentEntry.Name}");
            }

        }

Now to take advantage of TPL tried using Parallel.forEach,but that's throwing following exception:

An exception of type 'System.IO.InvalidDataException' occurred in System.IO.Compression.dll but was not handled in user code

Additional information: A local file header is corrupt.

  Parallel.ForEach(allZipEntries, currentEntry =>
        {
            if (currentEntry.FullName.Equals(currentEntry.Name))
            {
                currentEntry.ExtractToFile($"{tempPath}\\{currentEntry.Name}");
            }
            else
            {
                var subDirectoryPath = Path.Combine(tempPath, Path.GetDirectoryName(currentEntry.FullName));
                Directory.CreateDirectory(subDirectoryPath);
                currentEntry.ExtractToFile($"{subDirectoryPath}\\{currentEntry.Name}");
            }

        });

And to avoid this I could use a lock , but that defeats the whole purpose.

        Parallel.ForEach(allZipEntries, currentEntry =>
        {
            lock (thisLock)
            {
                if (currentEntry.FullName.Equals(currentEntry.Name))
                {
                    currentEntry.ExtractToFile($"{tempPath}\\{currentEntry.Name}");
                }
                else
                {
                    var subDirectoryPath = Path.Combine(tempPath, Path.GetDirectoryName(currentEntry.FullName));
                    Directory.CreateDirectory(subDirectoryPath);
                    currentEntry.ExtractToFile($"{subDirectoryPath}\\{currentEntry.Name}");
                }
            }

        });

Any other or better way around to extract the files?

like image 533
Simsons Avatar asked Oct 16 '25 11:10

Simsons


2 Answers

ZipFile is explicitly documented as not guaranteed to be threadsafe for instance members. This is no longer mentioned on the page. Snapshot from Nov 2016.

What you're trying to do cannot be done with this library. There may be some other libraries out there which do support multiple threads per zip file, but I wouldn't expect it.

You can use multi-threading to unzip multiple files at the same time, but not for multiple entries in the same zip file.

like image 110
Rob Avatar answered Oct 19 '25 00:10

Rob


Writing/reading in parallel is not a good idea as the hard drive controller will only run the requests one by one. By having multiple threads you just add overhead and queue them all up for no gain.

Try reading the file into memory first, this will avoid your exception however if you benchmark it you may find its actually slower due to the overhead of more threads.

If the file is very large and the decompression takes a long time, running the decompressing in parallel may improve speed, however the IO read/write will not. Most decompression libraries are already multi threaded anyway, so only if this one is not will you have any performance gain from doing this.

Edit: A dodgy way to make the library thread safe below. This runs slower/on par depending on the zip archive which proves the point that this is not something that will benefit from parallelism

Array.ForEach(Directory.GetFiles(@"c:\temp\output\"), File.Delete);

Stopwatch timer = new Stopwatch();
timer.Start();
int numberOfThreads = 8;
var clonedZipEntries = new List<ReadOnlyCollection<ZipArchiveEntry>>();

for (int i = 0; i < numberOfThreads; i++)
{
    clonedZipEntries.Add(ZipFile.Open(@"c:\temp\temp.zip", ZipArchiveMode.Read).Entries);
}
int totalZipEntries = clonedZipEntries[0].Count;
int numberOfEntriesPerThread = totalZipEntries / numberOfThreads;

Func<object,int> action = (object thread) =>
{
    int threadNumber = (int)thread;
    int startIndex = numberOfEntriesPerThread * threadNumber;
    int endIndex = startIndex + numberOfEntriesPerThread;
    if (endIndex > totalZipEntries) endIndex = totalZipEntries;

    for (int i = startIndex; i < endIndex; i++)
    {
        Console.WriteLine($"Extracting {clonedZipEntries[threadNumber][i].Name} via thread {threadNumber}");
        clonedZipEntries[threadNumber][i].ExtractToFile($@"C:\temp\output\{clonedZipEntries[threadNumber][i].Name}");
    }

    //Check for any remainders due to non evenly divisible size
    if (threadNumber == numberOfThreads - 1 && endIndex < totalZipEntries)
    {
        for (int i = endIndex; i < totalZipEntries; i++)
        {
            Console.WriteLine($"Extracting {clonedZipEntries[threadNumber][i].Name} via thread {threadNumber}");
            clonedZipEntries[threadNumber][i].ExtractToFile($@"C:\temp\output\{clonedZipEntries[threadNumber][i].Name}");
        }
    }
    return 0;
};


//Construct the tasks
var tasks = new List<Task<int>>();
for (int threadNumber = 0; threadNumber < numberOfThreads; threadNumber++) tasks.Add(Task<int>.Factory.StartNew(action, threadNumber));

Task.WaitAll(tasks.ToArray());
timer.Stop();

var threaderTimer = timer.ElapsedMilliseconds;



Array.ForEach(Directory.GetFiles(@"c:\temp\output\"), File.Delete);

timer.Reset();
timer.Start();
var entries = ZipFile.Open(@"c:\temp\temp.zip", ZipArchiveMode.Read).Entries;
foreach (var entry in entries)
{
    Console.WriteLine($"Extracting {entry.Name} via thread 1");
    entry.ExtractToFile($@"C:\temp\output\{entry.Name}");
}
timer.Stop();

Console.WriteLine($"Threaded version took: {threaderTimer} ms");
Console.WriteLine($"Non-Threaded version took: {timer.ElapsedMilliseconds} ms");


Console.ReadLine();
like image 23
rollsch Avatar answered Oct 19 '25 00:10

rollsch



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!