Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ZipFile.testzip() returning different results on Python 2 and Python 3

Using the zipfile module to unzip a large data file in Python works correctly on Python 2 but produces the following error on Python 3.6.0:

BadZipFile: Bad CRC-32 for file 'myfile.csv'

I traced this to error handling code checking the CRC values.

Using ZipFile.testzip() on Python 2 returns nothing (all files are fine). Running it on Python 3 returns 'myfile.csv' indicating a problem with that file.

Code to reproduce on both Python 2 and Python 3 (involves a 300 MB download, sorry):

import zipfile
import urllib
import sys

url = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"

if sys.version_info >= (3, 0, 0):
    urllib.request.urlretrieve(url, "vertnet_latest_amphibians.zip")
else:
    urllib.urlretrieve(url, "vertnet_latest_amphibians.zip")

archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.testzip()

Does anyone understand why this difference exists and if there's a way to get Python 3 to properly extract the file using:

archive.extract("vertnet_latest_amphibians.csv")
like image 612
Ethan White Avatar asked Jan 05 '17 19:01

Ethan White


1 Answers

The CRC value is OK. The CRC of 'vertnet_latest_amphibians.csv' recorded in the zip is 0x87203305. After extraction, this is indeed the CRC of the file.

However, the given uncompressed size is incorrect. The zip file records compressed size of 309,723,024 bytes, and uncompressed size of 292,198,614 bytes (that's smaller!). In reality, the uncompressed file is 4,587,165,910 bytes (4.3 GiB). This is bigger than the 4 GiB threshold where 32-bit counters break.

You can fix it like this (this worked in Python 3.5.2, at least):

archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.getinfo("vertnet_latest_amphibians.csv").file_size += 2**32
archive.testzip() # now passes
archive.extract("vertnet_latest_amphibians.csv") # now works
like image 168
Nick Matteo Avatar answered Sep 28 '22 01:09

Nick Matteo