How to count files inside zip in AWS S3 without downloading it?

Tags:

Case: There is a large zip file in an S3 bucket which contains a large number of images. Is there a way without downloading the whole file to read the metadata or something to know how many files are inside the zip file?

When the file is local, in python i can just open it as a zipfile() and then I call the namelist() method which returns a list of all the files inside, and I can count that. However not sure how to do this when the file resides in S3 without having to download it. Also if this is possible with Lambda would be best.

777

asked Jan 22 '17 09:01

alfredox

2 Answers

I think this will solve your problem:

Click to copy

import zlib
import zipfile
import io

def fetch(key_name, start, len, client_s3):
    """
    range-fetches a S3 key
    """
    end = start + len - 1
    s3_object = client_s3.get_object(Bucket=bucket_name, Key=key_name, Range="bytes=%d-%d" % (start, end))
    return s3_object['Body'].read()


def parse_int(bytes):
    """
    parses 2 or 4 little-endian bits into their corresponding integer value
    """
    val = (bytes[0]) + ((bytes[1]) << 8)
    if len(bytes) > 3:
        val += ((bytes[2]) << 16) + ((bytes[3]) << 24)
    return val


def list_files_in_s3_zipped_object(bucket_name, key_name, client_s3):
    """

    List files in s3 zipped object, without downloading it. Returns the number of files inside the zip file.
    See : https://stackoverflow.com/questions/41789176/how-to-count-files-inside-zip-in-aws-s3-without-downloading-it
    Based on : https://stackoverflow.com/questions/51351000/read-zip-files-from-s3-without-downloading-the-entire-file


    bucket_name: name of the bucket
    key_name:  path to zipfile inside bucket
    client_s3: an object created using boto3.client("s3")
    """

    bucket = bucket_name
    key = key_name

    response = client_s3.head_object(Bucket=bucket_name, Key=key_name)
    size = response['ContentLength']

    eocd = fetch(key_name, size - 22, 22, client_s3)

    # start offset and size of the central directory
    cd_start = parse_int(eocd[16:20])
    cd_size = parse_int(eocd[12:16])

    # fetch central directory, append EOCD, and open as zipfile!
    cd = fetch(key_name, cd_start, cd_size, client_s3)
    zip = zipfile.ZipFile(io.BytesIO(cd + eocd))

    print("there are %s files in the zipfile" % len(zip.filelist))

    for entry in zip.filelist:
        print("filename: %s (%s bytes uncompressed)" % (entry.filename, entry.file_size))
    return len(zip.filelist)

if __name__ == "__main__":
    import boto3
    import sys

    client_s3 = boto3.client("s3")
    bucket_name = sys.argv[1]
    key_name = sys.argv[2]
    list_files_in_s3_zipped_object(bucket_name, key_name, client_s3)

answered Sep 21 '22 16:09

Daniel777

I improved the already given solution - now it handles also files which are larger than 4GiB:

Click to copy

import boto3
import io
import struct
import zipfile

s3 = boto3.client('s3')

EOCD_RECORD_SIZE = 22
ZIP64_EOCD_RECORD_SIZE = 56
ZIP64_EOCD_LOCATOR_SIZE = 20

MAX_STANDARD_ZIP_SIZE = 4_294_967_295

def lambda_handler(event):
    bucket = event['bucket']
    key = event['key']
    zip_file = get_zip_file(bucket, key)
    print_zip_content(zip_file)

def get_zip_file(bucket, key):
    file_size = get_file_size(bucket, key)
    eocd_record = fetch(bucket, key, file_size - EOCD_RECORD_SIZE, EOCD_RECORD_SIZE)
    if file_size <= MAX_STANDARD_ZIP_SIZE:
        cd_start, cd_size = get_central_directory_metadata_from_eocd(eocd_record)
        central_directory = fetch(bucket, key, cd_start, cd_size)
        return zipfile.ZipFile(io.BytesIO(central_directory + eocd_record))
    else:
        zip64_eocd_record = fetch(bucket, key,
                                  file_size - (EOCD_RECORD_SIZE + ZIP64_EOCD_LOCATOR_SIZE + ZIP64_EOCD_RECORD_SIZE),
                                  ZIP64_EOCD_RECORD_SIZE)
        zip64_eocd_locator = fetch(bucket, key,
                                   file_size - (EOCD_RECORD_SIZE + ZIP64_EOCD_LOCATOR_SIZE),
                                   ZIP64_EOCD_LOCATOR_SIZE)
        cd_start, cd_size = get_central_directory_metadata_from_eocd64(zip64_eocd_record)
        central_directory = fetch(bucket, key, cd_start, cd_size)
        return zipfile.ZipFile(io.BytesIO(central_directory + zip64_eocd_record + zip64_eocd_locator + eocd_record))


def get_file_size(bucket, key):
    head_response = s3.head_object(Bucket=bucket, Key=key)
    return head_response['ContentLength']

def fetch(bucket, key, start, length):
    end = start + length - 1
    response = s3.get_object(Bucket=bucket, Key=key, Range="bytes=%d-%d" % (start, end))
    return response['Body'].read()

def get_central_directory_metadata_from_eocd(eocd):
    cd_size = parse_little_endian_to_int(eocd[12:16])
    cd_start = parse_little_endian_to_int(eocd[16:20])
    return cd_start, cd_size

def get_central_directory_metadata_from_eocd64(eocd64):
    cd_size = parse_little_endian_to_int(eocd64[40:48])
    cd_start = parse_little_endian_to_int(eocd64[48:56])
    return cd_start, cd_size

def parse_little_endian_to_int(little_endian_bytes):
    format_character = "i" if len(little_endian_bytes) == 4 else "q"
    return struct.unpack("<" + format_character, little_endian_bytes)[0]

def print_zip_content(zip_file):
    files = [zi.filename for zi in zip_file.filelist]
    print(f"{len(files)} files: {files}")

answered Sep 21 '22 16:09

kwiecien

Related questions
                            
                                Pandas: fastest way to resolve IP to country
                            
                                Attach database in sqlite3 with python
                            
                                How to define three methods circularly?
                            
                                where to save the verification code sent to the user for signing up
                            
                                Is request data already sanitized by Flask? [closed]
                            
                                Django `bulk_create` with related objects
                            
                                Why are (lt, gt) and (le, ge) reflections instead of (lt, ge) and (le, gt)?
                            
                                Jupyter/Ipython Notebook: How to time multiline segment of code?
                            
                                XGBoost and sparse matrix
                            
                                How to force MagicMock to copy function signature?
                            
                                Opencv Python Crop Image Using Numpy Array
                            
                                Mock scope goes beyond current test
                            
                                Make colorbar legend in Matplotlib/Cartopy
                            
                                Can't insert single column value in python using MySQL
                            
                                How to check if a Threading.Timer object is currently running in python
                            
                                Change the Color and Font of QString or QLineEdit
                            
                                Run a foreign exe inside a Python GUI (PyQt)
                            
                                AES Encryption in PowerShell and Python
                            
                                Asyncio with Django
                            
                                Showing Deprecation Warnings Only for a Specific Version When Testing Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count files inside zip in AWS S3 without downloading it?

Tags:

python

amazon-web-services

amazon-s3

boto

alfredox

People also ask

2 Answers

Daniel777

kwiecien

Recent Activity

Donate For Us