Seem to be knocking my head off a newbie error and I am not a newbie.
I have a 1.2G known-good zipfile 'train.zip' containing a 3.5G file 'train.csv'.
I open the zipfile and file itself without any exceptions (no LargeZipFile), but the resulting filestream appears to be empty. (UNIX 'unzip -c ...' confirms it is good)
The file objects returned by Python ZipFile.open()
are not seek'able or tell'able, so I can't check that.
Python distribution is 2.7.3 EPD-free 7.3-1 (32-bit) ; but should be ok for large zips. OS is MacOS 10.6.6
import csv
import zipfile as zf
zip_pathname = os.path.join('/my/data/path/.../', 'train.zip')
#with zf.ZipFile(zip_pathname).open('train.csv') as z:
z = zf.ZipFile(zip_pathname, 'r', zf.ZIP_DEFLATED, allowZip64=True) # I tried all permutations
z.debug = 1
z.testzip() # zipfile integrity is ok
z1 = z.open('train.csv', 'r') # our file keeps coming up empty?
# Check the info to confirm z1 is indeed a valid 3.5Gb file...
z1i = z.getinfo(file_name)
for att in ('filename', 'file_size', 'compress_size', 'compress_type', 'date_time', 'CRC', 'comment'):
print '%s:\t' % att, getattr(z1i,att)
# ... and it looks ok. compress_type = 9 ok?
#filename: train.csv
#file_size: 3729150126
#compress_size: 1284613649
#compress_type: 9
#date_time: (2012, 8, 20, 15, 30, 4)
#CRC: 1679210291
# All attempts to read z1 come up empty?!
# z1.readline() gives ''
# z1.readlines() gives []
# z1.read() takes ~60sec but also returns '' ?
# code I would want to run is:
reader = csv.reader(z1)
header = reader.next()
return reader
The cause is the combination of:
A command-line Workaround is to unzip, then rezip, to get a plain type 8: Deflated.
zipfile will throw an exception in 2.7 , 3.2+ I guess zipfile will never be able to actually handle type 9, for legal reasons. The Python doc makes no mention whatsoever that zipfile cannot handle other compression types :(
Compression type 9 is Deflate64/Enhanced Deflate, which Python's zipfile module doesn't support (essentially since zlib doesn't support Deflate64, which zipfile delegates to).
And if smaller files work fine, I suspect this zipfile was created by Windows Explorer: for larger files Windows Explorer can decided to use Deflate64.
(Note that Zip64 is different to Deflate64. Zip64 is supported by Python's zipfile module, and just makes a few changes to how some metadata is stored in the zipfile, but still uses regular Deflate for the compressed data.)
However, stream-unzip now supports Deflate64. Modifying its example to read from the local disk, and to read a CSV file as in your example:
import csv
from io import IOBase, TextIOWrapper
import os
from stream_unzip import stream_unzip
def get_zipped_chunks(zip_pathname):
with open(zip_pathname, 'rb') as f:
while True:
chunk = f.read(65536)
if not chunk:
break
yield chunk
def get_unzipped_chunks(zipped_chunks, filename)
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks):
if file_name != filename:
for chunk in unzipped_chunks:
pass
continue
yield from unzipped_chunks
def to_str_lines(iterable):
# Based on the answer at https://stackoverflow.com/a/70639580/1319998
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj(IOBase):
def readable(self):
return True
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
yield from TextIOWrapper(FileLikeObj(), encoding='utf-8', newline='')
zipped_chunks = get_zipped_chunks(os.path.join('/my/data/path/.../', 'train.zip'))
unzipped_chunks = get_unzipped_chunks(zipped_chunks, b'train.csv')
str_lines = to_str_lines(unzipped_chunks)
csv_reader = csv.reader(str_lines)
for row in csv_reader:
print(row)
My solution for handling compression types that aren't supported by Python's ZipFile was to rely on a call to 7zip when ZipFile.extractall fails.
from zipfile import ZipFile
import subprocess, sys
def Unzip(zipFile, destinationDirectory):
try:
with ZipFile(zipFile, 'r') as zipObj:
# Extract all the contents of zip file in different directory
zipObj.extractall(destinationDirectory)
except:
print("An exception occurred extracting with Python ZipFile library.")
print("Attempting to extract using 7zip")
subprocess.Popen(["7z", "e", f"{zipFile}", f"-o{destinationDirectory}", "-y"])
If the problem is because of lack of support for the Deflate64 algorithm in the Python standard library, there is now a package available named "zipfile-deflate64".
It is still listed as being in an "alpha" phase. I just started using it yesterday, 2022-07-18, and it did the job for me.
It is very easy to use, as importing it makes it so you can use the zipfile library like you normally would with added support for Deflate64.
Link to "zipfile-deflate64" package on pypi
Link to "zipfile-deflate64" project on GitHub
Here is an example of how to use it. The API is the same as built-in zipfile package:
import zipfile_deflate64 as zipfile
tag_hist_path = "path\\to\\your\\zipfile.zip"
parentZip = zipfile.ZipFile(tag_hist_path, mode="r", compression=zipfile.ZIP_DEFLATED64)
fileNames = [f.filename for f in parentZip.filelist]
memberArchive = parentZip.open(fileNames[0], mode="r")
b = memberArchive.read() #reading all bytes at once, assuming file isn't too big
txt = b.decode("utf-8") #decode bytes to text string
memberArchive.close()
parentZip.close()
And here is a more concise and cleaner way to work with such an archive, per @smci recommendation so you don't have to put the effort into managing the stream resources (i.e., closing them) in case of an error:
tag_hist_path = "path\\to\\your\\zipfile.zip"
with zipfile.ZipFile(tag_hist_path, mode="r", compression=zipfile.ZIP_DEFLATED64) as parentZip:
for fileNames in parentZip.filelist:
with parentZip.open(fileNames[0], mode="r") as memberArchive:
#Do something with each opened zipfile
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With