Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream and zip to S3 from AWS Lambda Node.JS

My goal is to create a large gzipped text file and put it into S3.

The file contents consist of blocks which I read in a loop from another source.

Because of the size of this file I can not hold all data in memory, so I need to somehow stream it directly to S3 and ZIP at the same time.

I understand how to perform this trick with the regular fs in Node.JS, but I am confused about whether is it possible to do the same trick with S3 from AWS Lambda? I know that s3.putObject can consume streamObject, but it seems to me that this stream should be already finalized when I perform putObject operation, what can cause exceeding of the allowed memory.

like image 892
Andremoniy Avatar asked Oct 18 '17 14:10

Andremoniy


People also ask

How do I extract a zip file in an Amazon S3 by using Lambda?

If you head to the Properties tab of your S3 bucket, you can set up an Event Notification for all object “create” events (or just PutObject events). As the destination, you can select the Lambda function where you will write your code to unzip and gzip files.

Can I upload zip file to S3 and unzip?

Here are the steps that I carried out : Upload a zip file(in my case it was a zipped application folder) to a S3 bucket (source bucket). Uploding file triggers a lambda function which extracts all the files and folders inside the ZIP file and uploads into new S3 bucket(target bucket).

Can I upload zip file to AWS S3?

You can upload any file type—images, backups, data, movies, etc. —into an S3 bucket. The maximum size of a file that you can upload by using the Amazon S3 console is 160 GB.


1 Answers

You can stream files (>5mb) into S3 buckets in chunks using multipart upload functions in the NodeJs aws-sdk.

This is not only useful for streaming large files into buckets, but also enables you to retry failed chunks (instead of a whole file) and parallelize upload of individual chunks (with multiple, upload lambdas, which could be useful in a serverless ETL setup for example). The order in which they arrive is not important as long as you track them and finalize the process once all have been uploaded.

To use the multipart upload, you should:

  1. initialize the process using createMultipartUpload and store the returned UploadId (you'll need it for chunk uploads)
  2. implement a Transform stream that would process data coming from the input stream
  3. implement a PassThrough stream which would buffer the data in large enough chunks before using uploadPart to push them to S3 (under the UploadId returned in step 1)
  4. track the returned ETags and PartNumbers from chunk uploads
  5. use the tracked ETags and PartNumbers to assemble/finalize the file on S3 using completeMultipartUpload

Here's the gist of it in a working code example which streams a file from iso.org, pipes it through gzip and into an S3 bucket. Don't forget to change the bucket name and make sure to run the lambda with 512mb of memory on node 6.10. You can use the code directly in the web GUI since there are no external dependencies.

NOTE: This is just a proof of concept that I put together for demonstration purposes. There is no retry logic for failed chunk uploads and error handling is almost non-existent which can literally cost you (e.g. abortMultipartUpload should be called upon cancelling the whole process to clean up the uploaded chunks since they remain stored and invisible on S3 even though the final file was never assembled). The input stream is being paused instead of queuing upload jobs and utilizing backpressure stream mechanisms etc.

like image 76
Unglückspilz Avatar answered Oct 13 '22 12:10

Unglückspilz