I need to calculate the CRC32, MD5 and SHA1 of the content of zip files without decompressing them.
So far I found out how to calculate these for the zip files itself, e.g.:
CRC32:
import zlib
zip_name = "test.zip"
def Crc32Hasher(file_path):
buf_size = 65536
crc32 = 0
with open(file_path, 'rb') as f:
while True:
data = f.read(buf_size)
if not data:
break
crc32 = zlib.crc32(data, crc32)
return format(crc32 & 0xFFFFFFFF, '08x')
print(Crc32Hasher(zip_name))
SHA1: (MD5 similarly)
import hashlib
zip_name = "test.zip"
def Sha1Hasher(file_path):
buf_size = 65536
sha1 = hashlib.sha1()
with open(file_path, 'rb') as f:
while True:
data = f.read(buf_size)
if not data:
break
sha1.update(data)
return format(sha1.hexdigest())
print(Sha1Hasher(zip_name))
For the content of the zip file, I can read the CRC32 from the zip directly without the need of calculating it as follow:
Read CRC32 of zip content:
import zipfile
zip_name = "test.zip"
if zip_name.lower().endswith(('.zip')):
z = zipfile.ZipFile(zip_name, "r")
for info in z.infolist():
print(info.filename,
format(info.CRC & 0xFFFFFFFF, '08x'))
But I couldn't figure out how to calculate the SHA1 (or MD5) of the content of zip files without decompressing them first. Is that somehow possible?
It is not possible. You can get CRC because it was carefully precalculated for you when archive is created (it is used for integrity check). Any other checksum/hash has to be calculated from scratch and will require at least streaming of the archive content, i.e. unpacking.
UPD: Possibble implementations
libarchive
: extra dependencies, supports many archive formats
import libarchive.public as libarchive
with libarchive.file_reader(fname) as archive:
for entry in archive:
md5 = hashlib.md5()
for block in entry.get_blocks():
md5.update(block)
print(str(entry), md5.hexdigest())
Native zipfile
: no dependencies, zip only
import zipfile
archive = zipfile.ZipFile(fname)
blocksize = 1024**2 #1M chunks
for fname in archive.namelist():
entry = archive.open(fname)
md5 = hashlib.md5()
while True:
block = entry.read(blocksize)
if not block:
break
md5.update(block)
print(fname, md5.hexdigest())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With