I am downloading files from S3, transforming the data inside them, and then creating a new file to upload to S3. The files I am downloading are less than 2GB but because I am enhancing the data, when I go to upload it, it is quite large (200gb+).
Currently you could imagine by code is like:
files = list_files_in_s3()
new_file = open('new_file','w')
for file in files:
file_data = fetch_object_from_s3(file)
str_out = ''
for data in file_data:
str_out += transform_data(data)
new_file.write(str_out)
s3.upload_file('new_file', 'bucket', 'key')
The problem with this is that 'new_file' is too big to fit on disk sometimes. Because of this, I want to use boto3 upload_fileobj
to upload the data in a stream form so that I don't need to have the temp file on disk at all.
Can someone help provide an example of this? The Python method seems quite different from Java which I am familar with.
You can use the amt-parameter in the read-function, documented here: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.
And then use MultiPartUpload documented here, to upload the file piece by piece: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload
https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
You should have a rule that deletes incomplete multipart uploads:
https://aws.amazon.com/es/blogs/aws/s3-lifecycle-management-update-support-for-multipart-uploads-and-delete-markers/
or else you may end up paying for incomplete data-parts stored in S3.
I copy-pasted something from my own script to to do this. This shows how you can stream all the way from downloading and to uploading. In case you have memory-limitations to consider. You could also alter this to store the file locally before you upload.
You will have to use MultiPartUpload anyway, since S3 have limitations on how large files you can upload in one action: https://aws.amazon.com/s3/faqs/
"The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability."
This is a code sample (I havent tested this code as it is here):
import boto3
amt = 1024*1024*10 # 10 MB at the time
session = boto3.Session(profile_name='yourprofile')
s3res = session.resource('s3')
source_s3file = "yourfile.file"
target_s3file = "yourfile.file"
source_s3obj = s3res.Object("your-bucket", source_s3file)
target_s3obj = s3res.Object("your-bucket", target_s3file)
# initiate MultiPartUpload
mpu = target_s3obj.initiate_multipart_upload()
partNr = 0
parts = []
body = source_s3obj.get()["Body"]
# get initial chunk
chunk = body.read(amt=amt).decode("utf-8") # this is where you use the amt-parameter
# Every time you call the read-function it reads the next chunk of data until its empty.
# Then do something with the chunk and upload it to S3 using MultiPartUpload
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
"PartNumber": partNr,
"ETag": response["ETag"]
})
while len(chunk) > 0:
# there is more data, get a new chunk
chunk = body.read(amt=amt).decode("utf-8")
# do something with the chunk, and upload the part
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
"PartNumber": partNr,
"ETag": response["ETag"]
})
# no more chunks, complete the upload
part_info = {}
part_info["Parts"] = parts
mpu_result = mpu.complete(MultipartUpload=part_info)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With