Case: There is a large zip file in an S3 bucket which contains a large number of images. Is there a way without downloading the whole file to read the metadata or something to know how many files are inside the zip file?
When the file is local, in python i can just open it as a zipfile() and then I call the namelist() method which returns a list of all the files inside, and I can count that. However not sure how to do this when the file resides in S3 without having to download it. Also if this is possible with Lambda would be best.
Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).
Count Number of Objects in an S3 Bucket with AWS Console # Open the AWS S3 console and click on your bucket's name. In the Objects tab, click the top row checkbox to select all files and folders or select the folders you want to count the files for. Click on the Actions button and select Calculate total size.
For reference purposes, with the Zip64 extension to the Zip file format enhancement, Zip files of 16 exabytes, which is over 16 billion gigabytes (or 2 to the 64th power bytes) are possible. Likewise, over 4 billion files and folders can be included in a Zip file.
Here are the steps that I carried out : Upload a zip file(in my case it was a zipped application folder) to a S3 bucket (source bucket). Uploding file triggers a lambda function which extracts all the files and folders inside the ZIP file and uploads into new S3 bucket(target bucket).
I think this will solve your problem:
import zlib
import zipfile
import io
def fetch(key_name, start, len, client_s3):
"""
range-fetches a S3 key
"""
end = start + len - 1
s3_object = client_s3.get_object(Bucket=bucket_name, Key=key_name, Range="bytes=%d-%d" % (start, end))
return s3_object['Body'].read()
def parse_int(bytes):
"""
parses 2 or 4 little-endian bits into their corresponding integer value
"""
val = (bytes[0]) + ((bytes[1]) << 8)
if len(bytes) > 3:
val += ((bytes[2]) << 16) + ((bytes[3]) << 24)
return val
def list_files_in_s3_zipped_object(bucket_name, key_name, client_s3):
"""
List files in s3 zipped object, without downloading it. Returns the number of files inside the zip file.
See : https://stackoverflow.com/questions/41789176/how-to-count-files-inside-zip-in-aws-s3-without-downloading-it
Based on : https://stackoverflow.com/questions/51351000/read-zip-files-from-s3-without-downloading-the-entire-file
bucket_name: name of the bucket
key_name: path to zipfile inside bucket
client_s3: an object created using boto3.client("s3")
"""
bucket = bucket_name
key = key_name
response = client_s3.head_object(Bucket=bucket_name, Key=key_name)
size = response['ContentLength']
eocd = fetch(key_name, size - 22, 22, client_s3)
# start offset and size of the central directory
cd_start = parse_int(eocd[16:20])
cd_size = parse_int(eocd[12:16])
# fetch central directory, append EOCD, and open as zipfile!
cd = fetch(key_name, cd_start, cd_size, client_s3)
zip = zipfile.ZipFile(io.BytesIO(cd + eocd))
print("there are %s files in the zipfile" % len(zip.filelist))
for entry in zip.filelist:
print("filename: %s (%s bytes uncompressed)" % (entry.filename, entry.file_size))
return len(zip.filelist)
if __name__ == "__main__":
import boto3
import sys
client_s3 = boto3.client("s3")
bucket_name = sys.argv[1]
key_name = sys.argv[2]
list_files_in_s3_zipped_object(bucket_name, key_name, client_s3)
I improved the already given solution - now it handles also files which are larger than 4GiB:
import boto3
import io
import struct
import zipfile
s3 = boto3.client('s3')
EOCD_RECORD_SIZE = 22
ZIP64_EOCD_RECORD_SIZE = 56
ZIP64_EOCD_LOCATOR_SIZE = 20
MAX_STANDARD_ZIP_SIZE = 4_294_967_295
def lambda_handler(event):
bucket = event['bucket']
key = event['key']
zip_file = get_zip_file(bucket, key)
print_zip_content(zip_file)
def get_zip_file(bucket, key):
file_size = get_file_size(bucket, key)
eocd_record = fetch(bucket, key, file_size - EOCD_RECORD_SIZE, EOCD_RECORD_SIZE)
if file_size <= MAX_STANDARD_ZIP_SIZE:
cd_start, cd_size = get_central_directory_metadata_from_eocd(eocd_record)
central_directory = fetch(bucket, key, cd_start, cd_size)
return zipfile.ZipFile(io.BytesIO(central_directory + eocd_record))
else:
zip64_eocd_record = fetch(bucket, key,
file_size - (EOCD_RECORD_SIZE + ZIP64_EOCD_LOCATOR_SIZE + ZIP64_EOCD_RECORD_SIZE),
ZIP64_EOCD_RECORD_SIZE)
zip64_eocd_locator = fetch(bucket, key,
file_size - (EOCD_RECORD_SIZE + ZIP64_EOCD_LOCATOR_SIZE),
ZIP64_EOCD_LOCATOR_SIZE)
cd_start, cd_size = get_central_directory_metadata_from_eocd64(zip64_eocd_record)
central_directory = fetch(bucket, key, cd_start, cd_size)
return zipfile.ZipFile(io.BytesIO(central_directory + zip64_eocd_record + zip64_eocd_locator + eocd_record))
def get_file_size(bucket, key):
head_response = s3.head_object(Bucket=bucket, Key=key)
return head_response['ContentLength']
def fetch(bucket, key, start, length):
end = start + length - 1
response = s3.get_object(Bucket=bucket, Key=key, Range="bytes=%d-%d" % (start, end))
return response['Body'].read()
def get_central_directory_metadata_from_eocd(eocd):
cd_size = parse_little_endian_to_int(eocd[12:16])
cd_start = parse_little_endian_to_int(eocd[16:20])
return cd_start, cd_size
def get_central_directory_metadata_from_eocd64(eocd64):
cd_size = parse_little_endian_to_int(eocd64[40:48])
cd_start = parse_little_endian_to_int(eocd64[48:56])
return cd_start, cd_size
def parse_little_endian_to_int(little_endian_bytes):
format_character = "i" if len(little_endian_bytes) == 4 else "q"
return struct.unpack("<" + format_character, little_endian_bytes)[0]
def print_zip_content(zip_file):
files = [zi.filename for zi in zip_file.filelist]
print(f"{len(files)} files: {files}")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With