My company is uploading large archive files to S3, and now wants them to be unzipped on S3. I wrote a lambda function based on unzip, triggered by arrival of a file to the xxx-zip bucket, which streams the zip file from S3, unzips the stream, and then streams the individual files to the xxx-data bucket.
It works, but I find it is much slower than I expect - even on a test file, zip size about 500k and holding around 500 files, this is timing out with a 60 second timeout set. Does this seem right? On my local system running with node it is faster than this. It seems to me that since files are being moved inside Amazon's cloud latency should be short, and since the files are being streamed the actual time taken should be about the time it takes to unzip the stream.
Is there an inherent reason why this won't work, or is there something in my code that is causing it to be so slow? It is the first time I've worked with node.js so I could be doing something badly. Or is there a better way to do this that I couldn't find with google?
Here is an outline of the code (BufferStream
is a class I wrote that wraps the Buffer returned by s3.getObject()
into a readStream
)
var aws = require('aws-sdk');
var s3 = new aws.S3({apiVersion: '2006-03-01'});
var unzip = require('unzip');
var stream = require('stream');
var util = require( "util" );
var fs = require('fs');
exports.handler = function(event, context) {
var zipfile = event.Records[0].s3.object.key;
s3.getObject({Bucket:SOURCE_BUCKET, Key:zipfile},
function(err, data) {
var errors = 0;
var total = 0;
var successful = 0;
var active = 0;
if (err) {
console.log('error: ' + err);
}
else {
console.log('Received zip file ' + zipfile);
new BufferStream(data.Body)
.pipe(unzip.Parse()).on('entry', function(entry) {
total++;
var filename = entry.path;
var in_process = ' (' + ++active + ' in process)';
console.log('extracting ' + entry.type + ' ' + filename + in_process );
s3.upload({Bucket:DEST_BUCKET, Key: filename, Body: entry}, {},
function(err, data) {
var remaining = ' (' + --active + ' remaining)';
if (err) {
// if for any reason the file is not read discard it
errors++
console.log('Error pushing ' + filename + ' to S3' + remaining + ': ' + err);
entry.autodrain();
}
else {
successful++;
console.log('successfully wrote ' + filename + ' to S3' + remaining);
}
});
});
console.log('Completed, ' + total + ' files processed, ' + successful + ' written to S3, ' + errors + ' failed');
context.done(null, '');
}
});
}
I suspect that the unzip module you are using is a JavaScript implementation that allows you to unzip zip files - which is very slow.
I recommend using gzip to compress the files and using the internal zlib library that is C compiled and should provide much better performance.
In case you choose to stick with zip, you could contact amazon support and ask to increase the 60 seconds limit on your lambda function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With