Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging files on AWS S3 (Using Apache Camel)

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again. These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?

I am using Apache Camel for routing.

like image 576
Sumit Srivastava Avatar asked Oct 10 '13 07:10

Sumit Srivastava


2 Answers

S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.

However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.

However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).

My production code does this by:

  1. Interrogating the manifest of files to be uploaded
  2. If first part is under 5MB, download pieces* and buffer to disk until 5MB is buffered.
  3. Append parts sequentially until file concatenation complete
  4. If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.

Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.

* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.

** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1

P.S. When I have time to make a Gist of this code I'll post the link here.

like image 170
Joseph Lust Avatar answered Nov 19 '22 04:11

Joseph Lust


You can use Multipart Upload with Copy to merge objects on S3 without downloading and uploading them again.

You can find some examples in Java, .NET or with the REST API here.

like image 26
danilop Avatar answered Nov 19 '22 04:11

danilop