Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating large zip files in AWS S3 in chunks

So, this question ends up being both about python and S3.

Let's say I have an S3 Bucket with these files :

file1 --------- 2GB
file2 --------- 3GB
file3 --------- 1.9GB
file4 --------- 5GB

These files were uploaded using a presigned post URL for S3

What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.

From my understanding, ideally the server needs to:

  1. Start a multipartupload job on S3
  2. Probably need to send a chunk to the multipart job as the header of the zip file;
  3. Download each file in the bucket chunk by chunk in some sort of stream as to not overflow memory
  4. Use said stream above to them create a zip chunk and send this in the multipart job
  5. Finish the multipart job and the zip file

Now, I honestly have no idea how to achieve this and if it is even possible, but some questions are :

  • How do I download a file in S3 in chunks? Preferably using boto3 or botocore
  • How do I create a zip file in chunks while freeing memory?
  • How do I connect this all in a multipartupload?

Edit: Now that I think about it, maybe I don't even need to put the ZIP file in S3, I can just directly stream to the client right? That would be so much better actually

Here's some hypothetical code assuming my edit above :

  #Let's assume Flask
  @app.route(/'download_bucket_as_zip'):
  def stream_file():
    def stream():
      #Probably needs to yield zip headers/metadata?
      for file in getFilesFromBucket():
         for chunk in file.readChunk(4000):
            zipchunk = bytesToZipChunk(chunk)
            yield zipchunk
    return Response(stream(), mimetype='application/zip')
like image 427
Mojimi Avatar asked Aug 21 '20 14:08

Mojimi


People also ask

How do I store large files on aws S3?

Instead of using the Amazon S3 console, try uploading the file using the AWS Command Line Interface (AWS CLI) or an AWS SDK. Note: If you use the Amazon S3 console, the maximum file size for uploads is 160 GB. To upload a file that is larger than 160 GB, use the AWS CLI, AWS SDK, or Amazon S3 REST API.

What is the best way for the application to upload the large files in S3?

When you upload large files to Amazon S3, it's a best practice to leverage multipart uploads. If you're using the AWS Command Line Interface (AWS CLI), then all high-level aws s3 commands automatically perform a multipart upload when the object is large. These high-level commands include aws s3 cp and aws s3 sync.

What is the largest size file you can transfer to S3?

Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, customers should consider using the Multipart Upload capability.

Can you zip files in S3?

Lambda code to zip files from S3 With this, you can zip files from Amazon S3 and allow your users to download multiple files, consuming less time and data without any real disk space and memory usage.


2 Answers

Your question is extremely complex, because solving it can send you down lots of rabbit holes.

I believe that Rahul Iyer is on the right track, because IMHO it would be easier to initiate a new EC2 instance and compress the files on this instance and move them back to a S3 bucket that only serves zip files to the client.

If your files were smaller you could use AWS Cloudfront to handling the zipping when a client requests a file.

During my research I did note that other languages, such as .Net and Java had APIs that handle streaming into zip files. I also looked at zipstream, which has been forked several times. It's unclear how zipstream can be used to stream a file for zipping.

The code below will chunk a file and write the chucks to a zip file. The input files were close to 12Gbs and the output file was almost 5Gbs.

During testing I didn't see any major issues with memory usage or big spikes.

I did add some pseudo S3 code to one of the posts below. I think more testing is required to understand how this code works on files in S3.

from io import RawIOBase
from zipfile import ZipFile
from zipfile import ZipInfo
from zipfile import ZIP_DEFLATED

# This module is needed for ZIP_DEFLATED
import zlib


class UnseekableStream(RawIOBase):
def __init__(self):
    self._buffer = b''

def writable(self):
    return True

def write(self, b):
    if self.closed:
        raise ValueError('The stream was closed!')
    self._buffer += b
    return len(b)

def get(self):
    chunk = self._buffer
    self._buffer = b''
    return chunk


def zipfile_generator(path, stream):
   with ZipFile(stream, mode='w') as zip_archive:
       z_info = ZipInfo.from_file(path)
       z_info.compress_type = ZIP_DEFLATED
       with open(path, 'rb') as entry, zip_archive.open(z_info, mode='w') as dest: 
          for chunk in iter(lambda: entry.read(16384), b''): # 16384 is the maximum size of an SSL/TLS buffer.
             dest.write(chunk)
             yield stream.get()
 yield stream.get()


stream = UnseekableStream()
# each on the input files was 4gb
files = ['input.txt', 'input2.txt', 'input3.txt']
with open("test.zip", "wb") as f:
   for item in files:
      for i in zipfile_generator(item, stream):
         f.write(i)
         f.flush()
stream.close()
f.close()

pseudocode s3/zip code

This code is strictly hypothetical, because it needs testing.

from io import RawIOBase
from zipfile import ZipFile
from zipfile import ZipInfo
from zipfile import ZIP_DEFLATED
import os

import boto3

# This module is needed for ZIP_DEFLATED
import zlib

session = boto3.Session(
aws_access_key_id='XXXXXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
region_name='XXXXXXXXXX')

s3 = session.resource('s3')
bucket_name = s3.Bucket('bucket name')

class UnseekableStream(RawIOBase):
   def __init__(self):
      self._buffer = b''

   def writable(self):
      return True

   def write(self, b):
      if self.closed:
        raise ValueError('The stream was closed!')
    self._buffer += b
    return len(b)

    def get(self):
      chunk = self._buffer
      self._buffer = b''
      return chunk


def zipfile_generator(path, stream):
   with ZipFile(stream, mode='w') as zip_archive:
       z_info = ZipInfo.from_file(path)
       z_info.compress_type = ZIP_DEFLATED
       with open(path, 'rb') as entry, zip_archive.open(z_info, mode='w') as dest:
           for chunk in iter(lambda: entry.read(16384), b''):
            dest.write(chunk)
              yield stream.get()
    yield stream.get()


stream = UnseekableStream()
with open("test.zip", "wb") as f:
   for file in bucket_name.objects.all():
     obj = s3.get_object(Bucket=bucket_name, Key=file.key)
     for i in zipfile_generator(obj.get(), stream):
        f.write(i)
        f.flush()
stream.close()
f.close()
like image 180
Life is complex Avatar answered Sep 21 '22 07:09

Life is complex


What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.

When you say server less, if what you mean is that you would like to use Lambda to create a zip file in S3, you will run into a few limitations:

  • Lambda has a time limit on how long functions can execute.
  • As Lambda has a memory limit, you may have trouble assembling a large file in a Lambda function
  • Lambda has a limit on the maximum size of a PUT call.

For the above reasons, I think the following approach is better:

  • When the files are required, create an EC2 instance on the fly. Perhaps your lambda function can trigger creation of the EC2 instance.
  • copy all the files into the instance store of the machine or even EFS.
  • Compress the files into a zip
  • Upload the file back to S3 or serve the file directly
  • Kill the EC2 instance.

In my opinion this would greatly simplify the code you have to write, as any code that runs on your laptop / desktop will probably work on the EC2 instance. You also won't have the time / space limitations of lambda.

As you can get rid of the EC2 instance once the zip file is uploaded back to S3, you don't have to worry about the cost of the server always running - just spin one up when you need it, and kill it when you're done.

The code for compressing multiple files in a folder could be as simple as :

From: https://code.tutsplus.com/tutorials/compressing-and-extracting-files-in-python--cms-26816

import os
import zipfile
 
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w')
 
for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'):
 
    for file in files:
        if file.endswith('.pdf'):
            fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED)
 
fantasy_zip.close()
like image 22
Rahul Iyer Avatar answered Sep 24 '22 07:09

Rahul Iyer