Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream large string to S3 using boto3

I am downloading files from S3, transforming the data inside them, and then creating a new file to upload to S3. The files I am downloading are less than 2GB but because I am enhancing the data, when I go to upload it, it is quite large (200gb+).

Currently you could imagine by code is like:

files = list_files_in_s3()
new_file = open('new_file','w')
for file in files:
    file_data = fetch_object_from_s3(file)
    str_out = ''
    for data in file_data:
        str_out += transform_data(data)
    new_file.write(str_out)
s3.upload_file('new_file', 'bucket', 'key')

The problem with this is that 'new_file' is too big to fit on disk sometimes. Because of this, I want to use boto3 upload_fileobj to upload the data in a stream form so that I don't need to have the temp file on disk at all.

Can someone help provide an example of this? The Python method seems quite different from Java which I am familar with.

like image 375
frosty Avatar asked Mar 06 '23 06:03

frosty


1 Answers

You can use the amt-parameter in the read-function, documented here: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.

And then use MultiPartUpload documented here, to upload the file piece by piece: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload

https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

You should have a rule that deletes incomplete multipart uploads:

https://aws.amazon.com/es/blogs/aws/s3-lifecycle-management-update-support-for-multipart-uploads-and-delete-markers/

or else you may end up paying for incomplete data-parts stored in S3.

I copy-pasted something from my own script to to do this. This shows how you can stream all the way from downloading and to uploading. In case you have memory-limitations to consider. You could also alter this to store the file locally before you upload.

You will have to use MultiPartUpload anyway, since S3 have limitations on how large files you can upload in one action: https://aws.amazon.com/s3/faqs/

"The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability."

This is a code sample (I havent tested this code as it is here):

import boto3
amt = 1024*1024*10 # 10 MB at the time
session = boto3.Session(profile_name='yourprofile')
s3res = session.resource('s3')
source_s3file = "yourfile.file"
target_s3file = "yourfile.file"
source_s3obj = s3res.Object("your-bucket", source_s3file)
target_s3obj = s3res.Object("your-bucket", target_s3file)

# initiate MultiPartUpload
mpu = target_s3obj.initiate_multipart_upload()
partNr = 0
parts = []

body = source_s3obj.get()["Body"]   
# get initial chunk
chunk = body.read(amt=amt).decode("utf-8") # this is where you use the amt-parameter
# Every time you call the read-function it reads the next chunk of data until its empty.
# Then do something with the chunk and upload it to S3 using MultiPartUpload
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
    "PartNumber": partNr,
    "ETag": response["ETag"]
})

while len(chunk) > 0:
    # there is more data, get a new chunk
    chunk = body.read(amt=amt).decode("utf-8")
    # do something with the chunk, and upload the part
    partNr += 1
    part = mpu.Part(partNr)
    response = part.upload(Body=chunk)
    parts.append({
        "PartNumber": partNr,
        "ETag": response["ETag"]
    })
# no more chunks, complete the upload
part_info = {}
part_info["Parts"] = parts
mpu_result = mpu.complete(MultipartUpload=part_info)
like image 60
Jørgen Frøland Avatar answered Apr 02 '23 19:04

Jørgen Frøland