Does AWS S3 GetObject read the partial of the Object being uploaded to s3 at the same time

Question

I have a lambda (L1) replacing a file (100MB) into an s3 location ( s3://bucket/folder/abc.json). I have another lambdas (L2 , L3) reading the same file at the same time, one via golang api another via Athena query. The s3 bucket/ folder is not versioned.

The question is: Does the lambdas L2, L3 read the old copy of the file till the new file got uploaded? Or does it read the partial file that is being uploaded? If its the later then how do you make sure that the L2, L3 read the files only on full upload?

Anon Coward · Accepted Answer

Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object.

On the surface, that sounds like it guarantees that your question is "yes, all clients will get the old version or the new version of a file". The truth is still a bit fuzzier than that.

Under the covers, many of the S3 APIs upload with a multi-part upload. This is well known, and doesn't change what I've said above, since the upload must be done before the object is available. However, many of the APIs also use multiple byte-range requests during downloads to download larger objects. This is problematic. It means a download might download part of file v1, then when it goes to download another part, it might get v2 if v2 was just uploaded.

With a little bit of effort, we can demonstrate this:

#!/usr/bin/env python3

import boto3
import multiprocessing
import io
import threading

bucket = "a-bucket-to-use"
key = "temp/dummy_key"
size = 104857600

class ProgressWatcher:
    def __init__(self, filesize, downloader):
        self._size = float(filesize)
        self._seen_so_far = 0
        self._lock = threading.Lock()
        self._launch = True
        self.downloader = downloader

    def __call__(self, bytes_amount):
        with self._lock:
            self._seen_so_far += bytes_amount
            if self._launch and (self._seen_so_far / self._size) >= 0.95:
                self._launch = False
                self.downloader.start()

def upload_helper(pattern, name, callback):
    # Upload a file of 100mb of "pattern" bytes
    s3 = boto3.client('s3')
    print(f"Uploading all {name}..")
    temp = io.BytesIO(pattern * size)
    s3.upload_fileobj(temp, bucket, key, Callback=callback)
    print(f"Done uploading all {name}")

def download_helper():
    # Download a file
    s3 = boto3.client('s3')
    print("Starting download...")
    s3.download_file(bucket, key, "temp_local_copy")
    print("Done with download")

def main():
    # See how long an upload takes
    upload_helper(b'0', "zeroes", None)

    # Watch how the next upload progresses, this will start a download when it's nearly done
    watcher = ProgressWatcher(size, multiprocessing.Process(target=download_helper))
    # Start another upload, overwriting the all-zero file with all-ones
    upload_helper(b'1', "ones", watcher)

    # Wait for the downloader to finish
    watcher.downloader.join()

    # See what the resulting file looks like
    print("Loading file..")
    counts = [0, 0]
    with open("temp_local_copy") as f:
        for x in f.read():
            counts[ord(x) - ord(b'0')] += 1
    
    print("Results")
    print(counts)

if __name__ == "__main__":
    main()

This code uploads an object to S3 that's 100mb of "0". It then starts an upload, using the same key, of 100mb of "1", and when that second upload is 95% done, it starts a download of that S3 object. It then counts how many "0" and "1"s it sees in the downloaded file.

Running this, with the latest versions of Python and Boto3, your exact output will no doubt differ than mine due to network conditions, but this is what I saw with a test run:

Uploading all zeroes..
Done uploading all zeroes
Uploading all ones..
Starting download...
Done uploading all ones
Done with download
Loading file..
Results
[83886080, 20971520]

The last line is important. The downloaded file was mostly "0" bytes, but there were 20mb of "1" bytes. Meaning, I got some part of v1 of the file and some part of v2, despite only performing one download call.

Now, in practice, this is unlikely to happen, and more so if you have better network bandwidth then I do here on a run of the mill home Internet connection. But it can always potentially happen. If you need to ensure that the downloaders never see a partial file like this, you either need to do something like verify a hash of the file is correct, or my preference is to upload with different keys each time, and have some mechnism for the client to discover the "latest" key, so they can download the whole unchanged file, even if an upload finishes while they're uploading.

Does AWS S3 GetObject read the partial of the Object being uploaded to s3 at the same time

Tags:

amazon-web-services

amazon-s3

chendu

1 Answers

Anon Coward

Recent Activity

Donate For Us

Does AWS S3 GetObject read the partial of the Object being uploaded to s3 at the same time

Tags:

amazon-web-services

amazon-s3

chendu

1 Answers

Anon Coward

Related questions

Recent Activity

Donate For Us