Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unzip a large ZIP file on Amazon S3 [closed]

I'm working at a company that processes very large CSV files. Clients upload the file to Amazon S3 via filepicker. Then multiple server processes can read the file in parallel (i.e. starting from different points) to process it and store it in a database. Optionally the clients may zip the file before uploading.

  1. Am I correct that the ZIP format does not allow decompression of a single file in parallel? That is, there is no way to have multiple processes read the ZIP file from different offsets (maybe with some overlap between blocks) and stream uncompressed data from there?

If I am correct, then I want a way to take the ZIP file on S3 and produce an unzipped CSV, also on S3.

  1. Does Amazon provide any services that can perform this task simply? I was hoping that Data Pipeline could do the job, but it seems to have limitations. For example "CopyActivity does not support copying multipart Amazon S3 files" (source) seems to suggest that I can't unzip anything larger than 5GB using that. My understanding of Data Pipeline is very limited so I don't know how suitable it is for this task or where I would look.
  2. Is there any SaaS that does the job? Edit: someone answered this question with their own product https://zipit.run/, which I think was a good answer, but it was downvoted so they deleted it.

I can write code to download, unzip, and multipart upload the file back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would have been ideal for running the code (to avoid provisioning unneeded resources) but execution time is limited to 60 seconds. Plus the use case seems so simple and generic I expect to find an existing solution.

like image 266
Alex Hall Avatar asked Sep 21 '15 14:09

Alex Hall


People also ask

How do I open a zip file that is too big?

Ordinarily, you could just double-click on a zipfile and macOS would just unzip it. You may need to fetch the free The Unarchiver application from the Mac App Store, and give it a try.

How do I unzip a file larger than 4GB?

If any single file in your zip file is over 4GB, then a 64-bit unarchiving program is required to open the . zip file, otherwise you will get a loop and be unable to extract the files.

What is the largest size file you can transfer to S3?

Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, customers should consider using the Multipart Upload capability.


2 Answers

@E.J. Brennan is right, I had a chat with AWS support, they told we cannot use Lambda to do this operation. Following is the guidance I got from Support.

  • Whenever a file is dropped in S3.

  • Trigger a notification to SQS.

  • Have EC2 listen to SQS.

  • Do the Un ZIP.

  • Add another notification to SQS and the next lambda function can do the further processing.

Hope it helps some one. I wasted lot of time solving this issue,

Solution/Work around!!

After a Longer struggle got a solution from my tech lead. We can use AWS Glue to solve this issue. That has more memory to use. It gets the job done.

Hope it helps some one.

like image 77
Dilip Rajkumar Avatar answered Oct 26 '22 23:10

Dilip Rajkumar


Your best bet is probably to have an S3 event notification sent to an SQS queue every time a zip file is uploaded to S3, and have on or more EC2 instances polling the queue waiting for files to unzip.

You may only need on running instance to do this, but you could also have a autoscale policy that spins up more instance if the size of the SQS queue grows too big for a single instance to do the de-zipping fast enough (as defined by you).

like image 36
E.J. Brennan Avatar answered Oct 26 '22 23:10

E.J. Brennan