So, this question ends up being both about python and S3. Let's say I have an S3 Bucket with these files : <pre class="prettyprint"><code>file1 --------- 2GB file2 --------- 3GB file3 --------- 1.9GB file4 --------- 5GB </code></pre> These files were uploaded using a presigned post URL for S3 What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup. From my understanding, ideally the server needs to: <ol> <li>Start a multipartupload job on S3</li> <li>Probably need to send a chunk to the multipart job as the header of the zip file;</li> <li>Download each file in the bucket chunk by chunk in some sort of stream as to not overflow memory</li> <li>Use said stream above to them create a zip chunk and send this in the multipart job</li> <li>Finish the multipart job and the zip file</li> </ol> Now, I honestly have no idea how to achieve this and if it is even possible, but some questions are : <ul> <li>How do I download a file in S3 in chunks? Preferably using boto3 or botocore</li> <li>How do I create a zip file in chunks while freeing memory?</li> <li>How do I connect this all in a multipartupload?</li> </ul> Edit: Now that I think about it, maybe I don't even need to put the ZIP file in S3, I can just directly stream to the client right? That would be so much better actually Here's some hypothetical code assuming my edit above : <pre class="prettyprint"><code> #Let's assume Flask @app.route(/'download_bucket_as_zip'): def stream_file(): def stream(): #Probably needs to yield zip headers/metadata? for file in getFilesFromBucket(): for chunk in file.readChunk(4000): zipchunk = bytesToZipChunk(chunk) yield zipchunk return Response(stream(), mimetype='application/zip') </code></pre>

<blockquote> What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup. </blockquote> When you say server less, if what you mean is that you would like to use Lambda to create a zip file in S3, you will run into a few limitations: <ul> <li>Lambda has a time limit on how long functions can execute.</li> <li>As Lambda has a memory limit, you may have trouble assembling a large file in a Lambda function</li> <li>Lambda has a limit on the maximum size of a PUT call.</li> </ul> For the above reasons, I think the following approach is better: <ul> <li>When the files are required, create an EC2 instance on the fly. Perhaps your lambda function can trigger creation of the EC2 instance.</li> <li>copy all the files into the instance store of the machine or even EFS.</li> <li>Compress the files into a zip</li> <li>Upload the file back to S3 or serve the file directly</li> <li>Kill the EC2 instance.</li> </ul> In my opinion this would greatly simplify the code you have to write, as any code that runs on your laptop / desktop will probably work on the EC2 instance. You also won't have the time / space limitations of lambda. As you can get rid of the EC2 instance once the zip file is uploaded back to S3, you don't have to worry about the cost of the server always running - just spin one up when you need it, and kill it when you're done. The code for compressing multiple files in a folder could be as simple as : From: https://code.tutsplus.com/tutorials/compressing-and-extracting-files-in-python--cms-26816 <pre class="prettyprint"><code>import os import zipfile fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w') for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'): for file in files: if file.endswith('.pdf'): fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED) fantasy_zip.close() </code></pre>

Creating large zip files in AWS S3 in chunks

Tags:

python

python-3.x

amazon-s3

boto3

So, this question ends up being both about python and S3.

Let's say I have an S3 Bucket with these files :

file1 --------- 2GB
file2 --------- 3GB
file3 --------- 1.9GB
file4 --------- 5GB

These files were uploaded using a presigned post URL for S3

What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.

From my understanding, ideally the server needs to:

Start a multipartupload job on S3
Probably need to send a chunk to the multipart job as the header of the zip file;
Download each file in the bucket chunk by chunk in some sort of stream as to not overflow memory
Use said stream above to them create a zip chunk and send this in the multipart job
Finish the multipart job and the zip file

Now, I honestly have no idea how to achieve this and if it is even possible, but some questions are :

How do I download a file in S3 in chunks? Preferably using boto3 or botocore
How do I create a zip file in chunks while freeing memory?
How do I connect this all in a multipartupload?

Edit: Now that I think about it, maybe I don't even need to put the ZIP file in S3, I can just directly stream to the client right? That would be so much better actually

Here's some hypothetical code assuming my edit above :

  #Let's assume Flask
  @app.route(/'download_bucket_as_zip'):
  def stream_file():
    def stream():
      #Probably needs to yield zip headers/metadata?
      for file in getFilesFromBucket():
         for chunk in file.readChunk(4000):
            zipchunk = bytesToZipChunk(chunk)
            yield zipchunk
    return Response(stream(), mimetype='application/zip')

427

asked Aug 21 '20 14:08

Mojimi

2 Answers

Your question is extremely complex, because solving it can send you down lots of rabbit holes.

I believe that Rahul Iyer is on the right track, because IMHO it would be easier to initiate a new EC2 instance and compress the files on this instance and move them back to a S3 bucket that only serves zip files to the client.

If your files were smaller you could use AWS Cloudfront to handling the zipping when a client requests a file.

During my research I did note that other languages, such as .Net and Java had APIs that handle streaming into zip files. I also looked at zipstream, which has been forked several times. It's unclear how zipstream can be used to stream a file for zipping.

The code below will chunk a file and write the chucks to a zip file. The input files were close to 12Gbs and the output file was almost 5Gbs.

During testing I didn't see any major issues with memory usage or big spikes.

I did add some pseudo S3 code to one of the posts below. I think more testing is required to understand how this code works on files in S3.

from io import RawIOBase
from zipfile import ZipFile
from zipfile import ZipInfo
from zipfile import ZIP_DEFLATED

# This module is needed for ZIP_DEFLATED
import zlib


class UnseekableStream(RawIOBase):
def __init__(self):
    self._buffer = b''

def writable(self):
    return True

def write(self, b):
    if self.closed:
        raise ValueError('The stream was closed!')
    self._buffer += b
    return len(b)

def get(self):
    chunk = self._buffer
    self._buffer = b''
    return chunk


def zipfile_generator(path, stream):
   with ZipFile(stream, mode='w') as zip_archive:
       z_info = ZipInfo.from_file(path)
       z_info.compress_type = ZIP_DEFLATED
       with open(path, 'rb') as entry, zip_archive.open(z_info, mode='w') as dest: 
          for chunk in iter(lambda: entry.read(16384), b''): # 16384 is the maximum size of an SSL/TLS buffer.
             dest.write(chunk)
             yield stream.get()
 yield stream.get()


stream = UnseekableStream()
# each on the input files was 4gb
files = ['input.txt', 'input2.txt', 'input3.txt']
with open("test.zip", "wb") as f:
   for item in files:
      for i in zipfile_generator(item, stream):
         f.write(i)
         f.flush()
stream.close()
f.close()

pseudocode s3/zip code

This code is strictly hypothetical, because it needs testing.

from io import RawIOBase
from zipfile import ZipFile
from zipfile import ZipInfo
from zipfile import ZIP_DEFLATED
import os

import boto3

# This module is needed for ZIP_DEFLATED
import zlib

session = boto3.Session(
aws_access_key_id='XXXXXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
region_name='XXXXXXXXXX')

s3 = session.resource('s3')
bucket_name = s3.Bucket('bucket name')

class UnseekableStream(RawIOBase):
   def __init__(self):
      self._buffer = b''

   def writable(self):
      return True

   def write(self, b):
      if self.closed:
        raise ValueError('The stream was closed!')
    self._buffer += b
    return len(b)

    def get(self):
      chunk = self._buffer
      self._buffer = b''
      return chunk


def zipfile_generator(path, stream):
   with ZipFile(stream, mode='w') as zip_archive:
       z_info = ZipInfo.from_file(path)
       z_info.compress_type = ZIP_DEFLATED
       with open(path, 'rb') as entry, zip_archive.open(z_info, mode='w') as dest:
           for chunk in iter(lambda: entry.read(16384), b''):
            dest.write(chunk)
              yield stream.get()
    yield stream.get()


stream = UnseekableStream()
with open("test.zip", "wb") as f:
   for file in bucket_name.objects.all():
     obj = s3.get_object(Bucket=bucket_name, Key=file.key)
     for i in zipfile_generator(obj.get(), stream):
        f.write(i)
        f.flush()
stream.close()
f.close()

180

answered Sep 21 '22 07:09

Life is complex

What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.

When you say server less, if what you mean is that you would like to use Lambda to create a zip file in S3, you will run into a few limitations:

Lambda has a time limit on how long functions can execute.
As Lambda has a memory limit, you may have trouble assembling a large file in a Lambda function
Lambda has a limit on the maximum size of a PUT call.

For the above reasons, I think the following approach is better:

When the files are required, create an EC2 instance on the fly. Perhaps your lambda function can trigger creation of the EC2 instance.
copy all the files into the instance store of the machine or even EFS.
Compress the files into a zip
Upload the file back to S3 or serve the file directly
Kill the EC2 instance.

In my opinion this would greatly simplify the code you have to write, as any code that runs on your laptop / desktop will probably work on the EC2 instance. You also won't have the time / space limitations of lambda.

As you can get rid of the EC2 instance once the zip file is uploaded back to S3, you don't have to worry about the cost of the server always running - just spin one up when you need it, and kill it when you're done.

The code for compressing multiple files in a folder could be as simple as :

From: https://code.tutsplus.com/tutorials/compressing-and-extracting-files-in-python--cms-26816

import os
import zipfile
 
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w')
 
for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'):
 
    for file in files:
        if file.endswith('.pdf'):
            fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED)
 
fantasy_zip.close()

answered Sep 24 '22 07:09

Rahul Iyer

Related questions
                            
                                Using tf.keras.utils.Sequence with model.fit_generator with use_multiprocessing=True generated warning
                            
                                Invalid HTTP_HOST header: '0.0.0.0:8000'. You may need to add '0.0.0.0' to ALLOWED_HOSTS
                            
                                How can a genetic algorithm optimize a neural network's weights without knowing the search volume?
                            
                                Dropdown component for Dash that supports clicking on selected items
                            
                                Plotly: How to make a figure with multiple lines and shaded area for standard deviations?
                            
                                CSS not recognized when served with FastAPI
                            
                                Share content of a webpage to Instagram story
                            
                                search in each of the s3 bucket and see if the given folder exists
                            
                                Downloading transformers models to use offline
                            
                                How do I detect and invoke a function when a python enum member is accessed
                            
                                Whenever I try to install torch, it displays killed
                            
                                singledispatchmethod and class method decorators in python 3.8
                            
                                What is tracing with regard to tf.function
                            
                                More efficient way to add columns with same string values in multiple dataframes with loops or lambdas?
                            
                                cx_Freeze error: baseline image directory does not exist
                            
                                Adjust width of dropdown menu option in Dash-Plotly
                            
                                How to reorder the keys of a dictionary?
                            
                                Pygame loop checking velocity rarely [duplicate]
                            
                                Azure Function - Exception: OSError: [Errno 30] Read-only file system:
                            
                                Pandas DataFrame Filling missing values in a column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With