I'm working at a company that processes very large CSV files. Clients upload the file to Amazon S3 via filepicker. Then multiple server processes can read the file in parallel (i.e. starting from different points) to process it and store it in a database. Optionally the clients may zip the file before uploading.
If I am correct, then I want a way to take the ZIP file on S3 and produce an unzipped CSV, also on S3.
I can write code to download, unzip, and multipart upload the file back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would have been ideal for running the code (to avoid provisioning unneeded resources) but execution time is limited to 60 seconds. Plus the use case seems so simple and generic I expect to find an existing solution.
Ordinarily, you could just double-click on a zipfile and macOS would just unzip it. You may need to fetch the free The Unarchiver application from the Mac App Store, and give it a try.
If any single file in your zip file is over 4GB, then a 64-bit unarchiving program is required to open the . zip file, otherwise you will get a loop and be unable to extract the files.
Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, customers should consider using the Multipart Upload capability.
@E.J. Brennan is right, I had a chat with AWS support, they told we cannot use Lambda to do this operation. Following is the guidance I got from Support.
Whenever a file is dropped in S3.
Trigger a notification to SQS.
Have EC2 listen to SQS.
Do the Un ZIP.
Add another notification to SQS and the next lambda function can do the further processing.
Hope it helps some one. I wasted lot of time solving this issue,
Solution/Work around!!
After a Longer struggle got a solution from my tech lead. We can use AWS Glue to solve this issue. That has more memory to use. It gets the job done.
Hope it helps some one.
Your best bet is probably to have an S3 event notification sent to an SQS queue every time a zip file is uploaded to S3, and have on or more EC2 instances polling the queue waiting for files to unzip.
You may only need on running instance to do this, but you could also have a autoscale policy that spins up more instance if the size of the SQS queue grows too big for a single instance to do the de-zipping fast enough (as defined by you).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With