My goal is to create a large gzipped text file and put it into S3. The file contents consist of blocks which I read in a loop from another source. Because of the size of this file I can not hold all data in memory, so I need to somehow stream it directly to S3 and ZIP at the same time. I understand how to perform this trick with the regular <code>fs</code> in Node.JS, but I am confused about whether is it possible to do the same trick with S3 from AWS Lambda? I know that <code>s3.putObject</code> can consume <code>streamObject</code>, but it seems to me that this stream should be already finalized when I perform <code>putObject</code> operation, what can cause exceeding of the allowed memory.

You can stream files (>5mb) into S3 buckets in chunks using multipart upload functions in the NodeJs aws-sdk. This is not only useful for streaming large files into buckets, but also enables you to retry failed chunks (instead of a whole file) and parallelize upload of individual chunks (with multiple, upload lambdas, which could be useful in a serverless ETL setup for example). The order in which they arrive is not important as long as you track them and finalize the process once all have been uploaded. To use the multipart upload, you should: <ol> <li>initialize the process using <code>createMultipartUpload</code> and store the returned <code>UploadId</code> (you'll need it for chunk uploads)</li> <li>implement a Transform stream that would process data coming from the input stream</li> <li>implement a PassThrough stream which would buffer the data in large enough chunks before using <code>uploadPart</code> to push them to S3 (under the <code>UploadId</code> returned in step 1)</li> <li>track the returned <code>ETags</code> and <code>PartNumbers</code> from chunk uploads</li> <li>use the tracked <code>ETags</code> and <code>PartNumbers</code> to assemble/finalize the file on S3 using <code>completeMultipartUpload</code> </li> </ol> Here's the gist of it in a working code example which streams a file from iso.org, pipes it through gzip and into an S3 bucket. Don't forget to change the bucket name and make sure to run the lambda with 512mb of memory on node 6.10. You can use the code directly in the web GUI since there are no external dependencies. NOTE: This is just a proof of concept that I put together for demonstration purposes. There is no retry logic for failed chunk uploads and error handling is almost non-existent which can literally cost you (e.g. <code>abortMultipartUpload</code> should be called upon cancelling the whole process to clean up the uploaded chunks since they remain stored and invisible on S3 even though the final file was never assembled). The input stream is being paused instead of queuing upload jobs and utilizing backpressure stream mechanisms etc.

Stream and zip to S3 from AWS Lambda Node.JS

Tags:

node.js

amazon-web-services

amazon-s3

aws-lambda

aws-sdk-nodejs

My goal is to create a large gzipped text file and put it into S3.

The file contents consist of blocks which I read in a loop from another source.

Because of the size of this file I can not hold all data in memory, so I need to somehow stream it directly to S3 and ZIP at the same time.

I understand how to perform this trick with the regular fs in Node.JS, but I am confused about whether is it possible to do the same trick with S3 from AWS Lambda? I know that s3.putObject can consume streamObject, but it seems to me that this stream should be already finalized when I perform putObject operation, what can cause exceeding of the allowed memory.

892

asked Oct 18 '17 14:10

Andremoniy

1 Answers

You can stream files (>5mb) into S3 buckets in chunks using multipart upload functions in the NodeJs aws-sdk.

This is not only useful for streaming large files into buckets, but also enables you to retry failed chunks (instead of a whole file) and parallelize upload of individual chunks (with multiple, upload lambdas, which could be useful in a serverless ETL setup for example). The order in which they arrive is not important as long as you track them and finalize the process once all have been uploaded.

To use the multipart upload, you should:

initialize the process using createMultipartUpload and store the returned UploadId (you'll need it for chunk uploads)
implement a Transform stream that would process data coming from the input stream
implement a PassThrough stream which would buffer the data in large enough chunks before using uploadPart to push them to S3 (under the UploadId returned in step 1)
track the returned ETags and PartNumbers from chunk uploads
use the tracked ETags and PartNumbers to assemble/finalize the file on S3 using completeMultipartUpload

Here's the gist of it in a working code example which streams a file from iso.org, pipes it through gzip and into an S3 bucket. Don't forget to change the bucket name and make sure to run the lambda with 512mb of memory on node 6.10. You can use the code directly in the web GUI since there are no external dependencies.

NOTE: This is just a proof of concept that I put together for demonstration purposes. There is no retry logic for failed chunk uploads and error handling is almost non-existent which can literally cost you (e.g. abortMultipartUpload should be called upon cancelling the whole process to clean up the uploaded chunks since they remain stored and invisible on S3 even though the final file was never assembled). The input stream is being paused instead of queuing upload jobs and utilizing backpressure stream mechanisms etc.

answered Oct 13 '22 12:10

Unglückspilz

Related questions
                            
                                return filename of uploaded file with multer and nodejs
                            
                                Socket.io listening on multiple ports?
                            
                                PhantomJS error
                            
                                sequelize.sync({ force: true }) is not working some times
                            
                                is belongsToMany on sequelize automatically create new join table?
                            
                                Custom Boom error messages
                            
                                Why do I get a "SyntaxError: Unexpected identifier" error when I use the Node.js Interactive Window in Visual Studio 2017?
                            
                                I've installed angular cli but its showing not recognized in cmd
                            
                                Node.JS Async / Await Dealing With Callbacks? [duplicate]
                            
                                How do yoy recursively compare two directories in node
                            
                                My CPU overheat when i'm using a mongo tailable cursor with .stream()
                            
                                How to run complex command in node js spawn?
                            
                                Sequelize TypeError: phone.setUser is not a Function
                            
                                Set custom header when using res.render
                            
                                AWS EC2, pm2 : Cannot see pm2 running list
                            
                                Why does await function return pending promise
                            
                                Apache needed for NodeJs?
                            
                                How to refactor this function with async/await?
                            
                                How to use one package manager for backend and frontend? (Yarn/NPM)
                            
                                Reading Console using Selenium Webdriver Chrome on Node.js

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With