Read ZIP files from S3 without downloading the entire file

Tags:

We have ZIP files that are 5-10GB in size. The typical ZIP file has 5-10 internal files, each 1-5 GB in size uncompressed.

I have a nice set of Python tools for reading these files. Basically, I can open a filename and if there is a ZIP file, the tools search in the ZIP file and then open the compressed file. It's all rather transparent.

I want to store these files in Amazon S3 as compressed files. I can fetch ranges of S3 files, so it should be possible to fetch the ZIP central directory (it's the end of the file, so I can just read the last 64KiB), find the component I want, download that, and stream directly to the calling process.

So my question is, how do I do that through the standard Python ZipFile API? It isn't documented how to replace the filesystem transport with an arbitrary object that supports POSIX semantics. Is this possible without rewriting the module?

887

asked Jul 15 '18 18:07

vy32

1 Answers

Here's an approach which does not need to fetch the entire file (full version available here).

It does require boto (or boto3), though (unless you can mimic the ranged GETs via AWS CLI; which I guess is quite possible as well).

import sys
import zlib
import zipfile
import io

import boto
from boto.s3.connection import OrdinaryCallingFormat


# range-fetches a S3 key
def fetch(key, start, len):
    end = start + len - 1
    return key.get_contents_as_string(headers={"Range": "bytes=%d-%d" % (start, end)})


# parses 2 or 4 little-endian bits into their corresponding integer value
def parse_int(bytes):
    val = ord(bytes[0]) + (ord(bytes[1]) << 8)
    if len(bytes) > 3:
        val += (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)
    return val


"""
bucket: name of the bucket
key:    path to zipfile inside bucket
entry:  pathname of zip entry to be retrieved (path/to/subdir/file.name)    
"""

# OrdinaryCallingFormat prevents certificate errors on bucket names with dots
# https://stackoverflow.com/questions/51604689/read-zip-files-from-amazon-s3-using-boto3-and-python#51605244
_bucket = boto.connect_s3(calling_format=OrdinaryCallingFormat()).get_bucket(bucket)
_key = _bucket.get_key(key)

# fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty)
size = _key.size
eocd = fetch(_key, size - 22, 22)

# start offset and size of the central directory
cd_start = parse_int(eocd[16:20])
cd_size = parse_int(eocd[12:16])

# fetch central directory, append EOCD, and open as zipfile!
cd = fetch(_key, cd_start, cd_size)
zip = zipfile.ZipFile(io.BytesIO(cd + eocd))


for zi in zip.filelist:
    if zi.filename == entry:
        # local file header starting at file name length + file content
        # (so we can reliably skip file name and extra fields)

        # in our "mock" zipfile, `header_offset`s are negative (probably because the leading content is missing)
        # so we have to add to it the CD start offset (`cd_start`) to get the actual offset

        file_head = fetch(_key, cd_start + zi.header_offset + 26, 4)
        name_len = parse_int(file_head[0:2])
        extra_len = parse_int(file_head[2:4])

        content = fetch(_key, cd_start + zi.header_offset + 30 + name_len + extra_len, zi.compress_size)

        # now `content` has the file entry you were looking for!
        # you should probably decompress it in context before passing it to some other program

        if zi.compress_type == zipfile.ZIP_DEFLATED:
            print zlib.decompressobj(-15).decompress(content)
        else:
            print content
        break

In your case you might need to write the fetched content to a local file (due to large size), unless memory usage is not a concern.

101

answered Nov 02 '22 00:11

Janaka Bandara

Related questions
                            
                                Send email task with correct context
                            
                                Converting pandas.DataFrame to bytes
                            
                                asyncio queue consumer coroutine
                            
                                python dictionary datetime as key, keyError
                            
                                RabbitMQ closes connection when processing long running tasks and timeout settings produce errors
                            
                                Right way to plot live data with django and bokeh
                            
                                Adding group bar charts as subplots in plotly
                            
                                How to disable pytest plugins for single tests
                            
                                Table legend in matplotlib
                            
                                How to keep track of players' rankings?
                            
                                How can I learn how to implement a custom Python asyncio event loop?
                            
                                Why is pandas '==' different than '.eq()'
                            
                                python SyntaxError: positional argument follows keyword argument [duplicate]
                            
                                How can I build a GUI to use inside a jupyter notebook?
                            
                                matplotlib Slow 3D scatter rotation
                            
                                Selenium driver's page source different than browser
                            
                                Access flask.g inside greenlet
                            
                                Comparison operators vs “rich comparison” methods in Python
                            
                                How to remove In[ ] and Out[ ] cell tags in a Jupyterlab notebook?
                            
                                How are PyTorch's tensors implemented?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read ZIP files from S3 without downloading the entire file

Tags:

python

amazon-s3

zipfile

boto3

boto

vy32

People also ask

1 Answers

Janaka Bandara

Recent Activity

Donate For Us