Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Opening zipfile of unsupported compression-type silently returns empty filestream, instead of throwing exception

Seem to be knocking my head off a newbie error and I am not a newbie. I have a 1.2G known-good zipfile 'train.zip' containing a 3.5G file 'train.csv'. I open the zipfile and file itself without any exceptions (no LargeZipFile), but the resulting filestream appears to be empty. (UNIX 'unzip -c ...' confirms it is good) The file objects returned by Python ZipFile.open() are not seek'able or tell'able, so I can't check that.

Python distribution is 2.7.3 EPD-free 7.3-1 (32-bit) ; but should be ok for large zips. OS is MacOS 10.6.6

import csv
import zipfile as zf

zip_pathname = os.path.join('/my/data/path/.../', 'train.zip')
#with zf.ZipFile(zip_pathname).open('train.csv') as z:
z = zf.ZipFile(zip_pathname, 'r', zf.ZIP_DEFLATED, allowZip64=True) # I tried all permutations
z.debug = 1
z.testzip() # zipfile integrity is ok

z1 = z.open('train.csv', 'r') # our file keeps coming up empty?

# Check the info to confirm z1 is indeed a valid 3.5Gb file...
z1i = z.getinfo(file_name)
for att in ('filename', 'file_size', 'compress_size', 'compress_type', 'date_time',  'CRC', 'comment'):
    print '%s:\t' % att, getattr(z1i,att)
# ... and it looks ok. compress_type = 9 ok?
#filename:  train.csv
#file_size: 3729150126
#compress_size: 1284613649
#compress_type: 9
#date_time: (2012, 8, 20, 15, 30, 4)
#CRC:   1679210291

# All attempts to read z1 come up empty?!
# z1.readline() gives ''
# z1.readlines() gives []
# z1.read() takes ~60sec but also returns '' ?

# code I would want to run is:
reader = csv.reader(z1)
header = reader.next()
return reader
like image 878
smci Avatar asked Oct 09 '12 23:10

smci


4 Answers

The cause is the combination of:

  • this file's compression type is type 9: Deflate64/Enhanced Deflate (PKWare's proprietary format, as opposed to the more common type 8)
  • and a zipfile bug: it will not throw an exception for unsupported compression-types. It used to just silently return a bad file object [Section 4.4.5 compression method]. Aargh. How bogus. UPDATE: I filed bug 14313 and it was fixed back in 2012 so it now raises NotImplementedError when the compression type is unknown.

A command-line Workaround is to unzip, then rezip, to get a plain type 8: Deflated.

zipfile will throw an exception in 2.7 , 3.2+ I guess zipfile will never be able to actually handle type 9, for legal reasons. The Python doc makes no mention whatsoever that zipfile cannot handle other compression types :(

like image 75
smci Avatar answered Nov 04 '22 22:11

smci


Compression type 9 is Deflate64/Enhanced Deflate, which Python's zipfile module doesn't support (essentially since zlib doesn't support Deflate64, which zipfile delegates to).

And if smaller files work fine, I suspect this zipfile was created by Windows Explorer: for larger files Windows Explorer can decided to use Deflate64.

(Note that Zip64 is different to Deflate64. Zip64 is supported by Python's zipfile module, and just makes a few changes to how some metadata is stored in the zipfile, but still uses regular Deflate for the compressed data.)

However, stream-unzip now supports Deflate64. Modifying its example to read from the local disk, and to read a CSV file as in your example:

import csv
from io import IOBase, TextIOWrapper
import os

from stream_unzip import stream_unzip

def get_zipped_chunks(zip_pathname):
    with open(zip_pathname, 'rb') as f:
       while True:
           chunk = f.read(65536)
           if not chunk:
               break
           yield chunk

def get_unzipped_chunks(zipped_chunks, filename)
    for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks):
        if file_name != filename:
            for chunk in unzipped_chunks:
                pass
            continue
        yield from unzipped_chunks

def to_str_lines(iterable):
    # Based on the answer at https://stackoverflow.com/a/70639580/1319998
    chunk = b''
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield:offset]

    class FileLikeObj(IOBase):
        def readable(self):
            return True
        def read(self, size=-1):
            return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))

    yield from TextIOWrapper(FileLikeObj(), encoding='utf-8', newline='')

zipped_chunks = get_zipped_chunks(os.path.join('/my/data/path/.../', 'train.zip'))
unzipped_chunks = get_unzipped_chunks(zipped_chunks, b'train.csv')
str_lines = to_str_lines(unzipped_chunks)
csv_reader = csv.reader(str_lines)

for row in csv_reader:
    print(row)
like image 39
Michal Charemza Avatar answered Nov 05 '22 00:11

Michal Charemza


My solution for handling compression types that aren't supported by Python's ZipFile was to rely on a call to 7zip when ZipFile.extractall fails.

from zipfile import ZipFile
import subprocess, sys

def Unzip(zipFile, destinationDirectory):
    try:
        with ZipFile(zipFile, 'r') as zipObj:
            # Extract all the contents of zip file in different directory
            zipObj.extractall(destinationDirectory)
    except:
        print("An exception occurred extracting with Python ZipFile library.")
        print("Attempting to extract using 7zip")
        subprocess.Popen(["7z", "e", f"{zipFile}", f"-o{destinationDirectory}", "-y"])
like image 4
Brett Allen Avatar answered Nov 04 '22 23:11

Brett Allen


If the problem is because of lack of support for the Deflate64 algorithm in the Python standard library, there is now a package available named "zipfile-deflate64".

It is still listed as being in an "alpha" phase. I just started using it yesterday, 2022-07-18, and it did the job for me.

It is very easy to use, as importing it makes it so you can use the zipfile library like you normally would with added support for Deflate64.

Link to "zipfile-deflate64" package on pypi

Link to "zipfile-deflate64" project on GitHub

Here is an example of how to use it. The API is the same as built-in zipfile package:

import zipfile_deflate64 as zipfile

tag_hist_path = "path\\to\\your\\zipfile.zip"
parentZip = zipfile.ZipFile(tag_hist_path, mode="r", compression=zipfile.ZIP_DEFLATED64)
fileNames = [f.filename for f in parentZip.filelist]
memberArchive = parentZip.open(fileNames[0], mode="r")
b = memberArchive.read() #reading all bytes at once, assuming file isn't too big
txt = b.decode("utf-8") #decode bytes to text string
memberArchive.close()
parentZip.close()

And here is a more concise and cleaner way to work with such an archive, per @smci recommendation so you don't have to put the effort into managing the stream resources (i.e., closing them) in case of an error:

tag_hist_path = "path\\to\\your\\zipfile.zip"
with zipfile.ZipFile(tag_hist_path, mode="r", compression=zipfile.ZIP_DEFLATED64) as parentZip:
    for fileNames in parentZip.filelist:
        with parentZip.open(fileNames[0], mode="r") as memberArchive:
            #Do something with each opened zipfile
like image 1
BioData41 Avatar answered Nov 05 '22 00:11

BioData41