Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate CRC32, MD5 and SHA1 of zip content without decompression in Python

I need to calculate the CRC32, MD5 and SHA1 of the content of zip files without decompressing them.

So far I found out how to calculate these for the zip files itself, e.g.:

CRC32:

import zlib


zip_name = "test.zip"


def Crc32Hasher(file_path):

    buf_size = 65536
    crc32 = 0

    with open(file_path, 'rb') as f:
        while True:
            data = f.read(buf_size)
            if not data:
                break
            crc32 = zlib.crc32(data, crc32)

    return format(crc32 & 0xFFFFFFFF, '08x')


print(Crc32Hasher(zip_name))

SHA1: (MD5 similarly)

import hashlib


zip_name = "test.zip"


def Sha1Hasher(file_path):

    buf_size = 65536
    sha1 = hashlib.sha1()

    with open(file_path, 'rb') as f:
        while True:
            data = f.read(buf_size)
            if not data:
                break
            sha1.update(data)

    return format(sha1.hexdigest())


print(Sha1Hasher(zip_name))

For the content of the zip file, I can read the CRC32 from the zip directly without the need of calculating it as follow:

Read CRC32 of zip content:

import zipfile

zip_name = "test.zip"

if zip_name.lower().endswith(('.zip')):
    z = zipfile.ZipFile(zip_name, "r")

for info in z.infolist():

    print(info.filename,
          format(info.CRC & 0xFFFFFFFF, '08x'))

But I couldn't figure out how to calculate the SHA1 (or MD5) of the content of zip files without decompressing them first. Is that somehow possible?

like image 786
paradadf Avatar asked May 22 '17 03:05

paradadf


1 Answers

It is not possible. You can get CRC because it was carefully precalculated for you when archive is created (it is used for integrity check). Any other checksum/hash has to be calculated from scratch and will require at least streaming of the archive content, i.e. unpacking.

UPD: Possibble implementations

libarchive: extra dependencies, supports many archive formats

import libarchive.public as libarchive
with libarchive.file_reader(fname) as archive:
    for entry in archive:
        md5 = hashlib.md5()
        for block in entry.get_blocks():
            md5.update(block)
        print(str(entry), md5.hexdigest())

Native zipfile: no dependencies, zip only

import zipfile

archive = zipfile.ZipFile(fname)
blocksize = 1024**2  #1M chunks
for fname in archive.namelist():
    entry = archive.open(fname)
    md5 = hashlib.md5()
    while True:
        block = entry.read(blocksize)
        if not block:
            break
        md5.update(block)
    print(fname, md5.hexdigest())
like image 165
Marat Avatar answered Sep 25 '22 01:09

Marat