So, this question ends up being both about python and S3.
Let's say I have an S3 Bucket with these files :
file1 --------- 2GB
file2 --------- 3GB
file3 --------- 1.9GB
file4 --------- 5GB
These files were uploaded using a presigned post URL for S3
What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.
From my understanding, ideally the server needs to:
Now, I honestly have no idea how to achieve this and if it is even possible, but some questions are :
Edit: Now that I think about it, maybe I don't even need to put the ZIP file in S3, I can just directly stream to the client right? That would be so much better actually
Here's some hypothetical code assuming my edit above :
#Let's assume Flask
@app.route(/'download_bucket_as_zip'):
def stream_file():
def stream():
#Probably needs to yield zip headers/metadata?
for file in getFilesFromBucket():
for chunk in file.readChunk(4000):
zipchunk = bytesToZipChunk(chunk)
yield zipchunk
return Response(stream(), mimetype='application/zip')
Instead of using the Amazon S3 console, try uploading the file using the AWS Command Line Interface (AWS CLI) or an AWS SDK. Note: If you use the Amazon S3 console, the maximum file size for uploads is 160 GB. To upload a file that is larger than 160 GB, use the AWS CLI, AWS SDK, or Amazon S3 REST API.
When you upload large files to Amazon S3, it's a best practice to leverage multipart uploads. If you're using the AWS Command Line Interface (AWS CLI), then all high-level aws s3 commands automatically perform a multipart upload when the object is large. These high-level commands include aws s3 cp and aws s3 sync.
Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, customers should consider using the Multipart Upload capability.
Lambda code to zip files from S3 With this, you can zip files from Amazon S3 and allow your users to download multiple files, consuming less time and data without any real disk space and memory usage.
Your question is extremely complex, because solving it can send you down lots of rabbit holes.
I believe that Rahul Iyer is on the right track, because IMHO it would be easier to initiate a new EC2 instance and compress the files on this instance and move them back to a S3 bucket that only serves zip files to the client.
If your files were smaller you could use AWS Cloudfront to handling the zipping when a client requests a file.
During my research I did note that other languages, such as .Net and Java had APIs that handle streaming into zip files. I also looked at zipstream, which has been forked several times. It's unclear how zipstream can be used to stream a file for zipping.
The code below will chunk a file and write the chucks to a zip file. The input files were close to 12Gbs and the output file was almost 5Gbs.
During testing I didn't see any major issues with memory usage or big spikes.
I did add some pseudo S3 code to one of the posts below. I think more testing is required to understand how this code works on files in S3.
from io import RawIOBase
from zipfile import ZipFile
from zipfile import ZipInfo
from zipfile import ZIP_DEFLATED
# This module is needed for ZIP_DEFLATED
import zlib
class UnseekableStream(RawIOBase):
def __init__(self):
self._buffer = b''
def writable(self):
return True
def write(self, b):
if self.closed:
raise ValueError('The stream was closed!')
self._buffer += b
return len(b)
def get(self):
chunk = self._buffer
self._buffer = b''
return chunk
def zipfile_generator(path, stream):
with ZipFile(stream, mode='w') as zip_archive:
z_info = ZipInfo.from_file(path)
z_info.compress_type = ZIP_DEFLATED
with open(path, 'rb') as entry, zip_archive.open(z_info, mode='w') as dest:
for chunk in iter(lambda: entry.read(16384), b''): # 16384 is the maximum size of an SSL/TLS buffer.
dest.write(chunk)
yield stream.get()
yield stream.get()
stream = UnseekableStream()
# each on the input files was 4gb
files = ['input.txt', 'input2.txt', 'input3.txt']
with open("test.zip", "wb") as f:
for item in files:
for i in zipfile_generator(item, stream):
f.write(i)
f.flush()
stream.close()
f.close()
This code is strictly hypothetical, because it needs testing.
from io import RawIOBase
from zipfile import ZipFile
from zipfile import ZipInfo
from zipfile import ZIP_DEFLATED
import os
import boto3
# This module is needed for ZIP_DEFLATED
import zlib
session = boto3.Session(
aws_access_key_id='XXXXXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
region_name='XXXXXXXXXX')
s3 = session.resource('s3')
bucket_name = s3.Bucket('bucket name')
class UnseekableStream(RawIOBase):
def __init__(self):
self._buffer = b''
def writable(self):
return True
def write(self, b):
if self.closed:
raise ValueError('The stream was closed!')
self._buffer += b
return len(b)
def get(self):
chunk = self._buffer
self._buffer = b''
return chunk
def zipfile_generator(path, stream):
with ZipFile(stream, mode='w') as zip_archive:
z_info = ZipInfo.from_file(path)
z_info.compress_type = ZIP_DEFLATED
with open(path, 'rb') as entry, zip_archive.open(z_info, mode='w') as dest:
for chunk in iter(lambda: entry.read(16384), b''):
dest.write(chunk)
yield stream.get()
yield stream.get()
stream = UnseekableStream()
with open("test.zip", "wb") as f:
for file in bucket_name.objects.all():
obj = s3.get_object(Bucket=bucket_name, Key=file.key)
for i in zipfile_generator(obj.get(), stream):
f.write(i)
f.flush()
stream.close()
f.close()
What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.
When you say server less, if what you mean is that you would like to use Lambda to create a zip file in S3, you will run into a few limitations:
For the above reasons, I think the following approach is better:
In my opinion this would greatly simplify the code you have to write, as any code that runs on your laptop / desktop will probably work on the EC2 instance. You also won't have the time / space limitations of lambda.
As you can get rid of the EC2 instance once the zip file is uploaded back to S3, you don't have to worry about the cost of the server always running - just spin one up when you need it, and kill it when you're done.
The code for compressing multiple files in a folder could be as simple as :
From: https://code.tutsplus.com/tutorials/compressing-and-extracting-files-in-python--cms-26816
import os
import zipfile
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w')
for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'):
for file in files:
if file.endswith('.pdf'):
fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED)
fantasy_zip.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With