I'm working on a machine with limited memory, and I'd like to upload a dynamically generated (not-from-disk) file in a streaming manner to S3. In other words, I don't know the file size when I start the upload, but I'll know it by the end. Normally a PUT request has a Content-Length header, but perhaps there is a way around this, such as using multipart or chunked content-type.
S3 can support streaming uploads. For example, see here:
http://blog.odonnell.nu/posts/streaming-uploads-s3-python-and-poster/
My question is, can I accomplish the same thing without having to specify the file length at the start of the upload?
You can set up the Kinesis Stream to S3 to start streaming your data to Amazon S3 buckets using the following steps: Step 1: Signing in to the AWS Console for Amazon Kinesis. Step 2: Configuring the Delivery Stream. Step 3: Transforming Records using a Lambda Function.
When you upload large files to Amazon S3, it's a best practice to leverage multipart uploads. If you're using the AWS Command Line Interface (AWS CLI), then all high-level aws s3 commands automatically perform a multipart upload when the object is large. These high-level commands include aws s3 cp and aws s3 sync.
Upload a single object using the Amazon S3 Console—With the Amazon S3 Console, you can upload a single object up to 160 GB in size.
Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, customers should consider using the Multipart Upload capability.
You have to upload your file in 5MiB+ chunks via S3's multipart API. Each of those chunks requires a Content-Length but you can avoid loading huge amounts of data (100MiB+) into memory.
S3 allows up to 10,000 parts. So by choosing a part-size of 5MiB you will be able to upload dynamic files of up to 50GiB. Should be enough for most use-cases.
However: If you need more, you have to increase your part-size. Either by using a higher part-size (10MiB for example) or by increasing it during the upload.
First 25 parts: 5MiB (total: 125MiB) Next 25 parts: 10MiB (total: 375MiB) Next 25 parts: 25MiB (total: 1GiB) Next 25 parts: 50MiB (total: 2.25GiB) After that: 100MiB
This will allow you to upload files of up to 1TB (S3's limit for a single file is 5TB right now) without wasting memory unnecessarily.
His problem is different from yours - he knows and uses the Content-Length before the upload. He wants to improve on this situation: Many libraries handle uploads by loading all data from a file into memory. In pseudo-code that would be something like this:
data = File.read(file_name) request = new S3::PutFileRequest() request.setHeader('Content-Length', data.size) request.setBody(data) request.send()
His solution does it by getting the Content-Length
via the filesystem-API. He then streams the data from disk into the request-stream. In pseudo-code:
upload = new S3::PutFileRequestStream() upload.writeHeader('Content-Length', File.getSize(file_name)) upload.flushHeader() input = File.open(file_name, File::READONLY_FLAG) while (data = input.read()) input.write(data) end upload.flush() upload.close()
Putting this answer here for others in case it helps:
If you don't know the length of the data you are streaming up to S3, you can use S3FileInfo
and its OpenWrite()
method to write arbitrary data into S3.
var fileInfo = new S3FileInfo(amazonS3Client, "MyBucket", "streamed-file.txt"); using (var outputStream = fileInfo.OpenWrite()) { using (var streamWriter = new StreamWriter(outputStream)) { streamWriter.WriteLine("Hello world"); // You can do as many writes as you want here } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With