Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sync local folder to s3 bucket using boto3

I've noticed there is no API in boto3 for the "sync" operation that you can perform through the command line.

So,

How do I sync a local folder to a given bucket using boto3?

like image 459
Raydel Miranda Avatar asked Jul 04 '19 17:07

Raydel Miranda


People also ask

Can you upload folder to S3?

To upload folders and files to an S3 bucketSign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/ . In the Buckets list, choose the name of the bucket that you want to upload your folders or files to. Choose Upload.

How do I sync my S3 bucket?

Run the following command: aws s3 sync . s3://<your-bucket-name>/ This will sync all files from the current directory to your bucket's root directory, uploading any that are outdated or missing in the bucket.

How do I connect my S3 to Boto3?

Setting up a client To access any AWS service with Boto3, we have to connect to it with a client. Here, we create an S3 client. We specify the region in which our data lives. We also have to pass the access key and the password, which we can generate in the AWS console, as described here.

Can we create folder in S3 bucket using Python?

Can we create folder in S3 bucket using Python? There is no concept of folders or directories in S3. You can create file names like "abc/xys/uvw/123. jpg" , which many S3 access tools like S3Fox show like a directory structure, but it's actually just a single file in a bucket.


1 Answers

I've just implemented a simple class for this matter. I'm posting it here hoping it help anyone with the same issue.

You could modify S3Sync.sync in order to take file size into account.

class S3Sync:
    """
    Class that holds the operations needed for synchronize local dirs to a given bucket.
    """

    def __init__(self):
        self._s3 = boto3.client('s3')

    def sync(self, source: str, dest: str) -> [str]:
        """
        Sync source to dest, this means that all elements existing in
        source that not exists in dest will be copied to dest.

        No element will be deleted.

        :param source: Source folder.
        :param dest: Destination folder.

        :return: None
        """

        paths = self.list_source_objects(source_folder=source)
        objects = self.list_bucket_objects(dest)

        # Getting the keys and ordering to perform binary search
        # each time we want to check if any paths is already there.
        object_keys = [obj['Key'] for obj in objects]
        object_keys.sort()
        object_keys_length = len(object_keys)
        
        for path in paths:
            # Binary search.
            index = bisect_left(object_keys, path)
            if index == object_keys_length:
                # If path not found in object_keys, it has to be sync-ed.
                self._s3.upload_file(str(Path(source).joinpath(path)),  Bucket=dest, Key=path)

    def list_bucket_objects(self, bucket: str) -> [dict]:
        """
        List all objects for the given bucket.

        :param bucket: Bucket name.
        :return: A [dict] containing the elements in the bucket.

        Example of a single object.

        {
            'Key': 'example/example.txt',
            'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
            'ETag': '"b11564415be7f58435013b414a59ae5c"',
            'Size': 115280,
            'StorageClass': 'STANDARD',
            'Owner': {
                'DisplayName': 'webfile',
                'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
            }
        }

        """
        try:
            contents = self._s3.list_objects(Bucket=bucket)['Contents']
        except KeyError:
            # No Contents Key, empty bucket.
            return []
        else:
            return contents

    @staticmethod
    def list_source_objects(source_folder: str) -> [str]:
        """
        :param source_folder:  Root folder for resources you want to list.
        :return: A [str] containing relative names of the files.

        Example:

            /tmp
                - example
                    - file_1.txt
                    - some_folder
                        - file_2.txt

            >>> sync.list_source_objects("/tmp/example")
            ['file_1.txt', 'some_folder/file_2.txt']

        """

        path = Path(source_folder)

        paths = []

        for file_path in path.rglob("*"):
            if file_path.is_dir():
                continue
            str_file_path = str(file_path)
            str_file_path = str_file_path.replace(f'{str(path)}/', "")
            paths.append(str_file_path)

        return paths


if __name__ == '__main__':
    sync = S3Sync()
    sync.sync("/temp/some_folder", "some_bucket_name")

Update:

@Z.Wei commented:

Dig into this a little to deal with the weird bisect function. We may just use if path not in object_keys:?

I think is a interesting question that worth an answer update and not get lost in the comments.

Answer:

No, if path not in object_keys would performs a linear search O(n). bisect_* performs a binary search (list has to be ordered) which is O(log(n)).

Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in keyword.

Take in to account that you must check every path in the source against every path in the destination making the use of in O(m * n), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( n * log(n) )

But ...

If I think about it, you could use sets to make the algorithm even faster (and simple, hence more pythonic):

def sync(self, source: str, dest: str) -> [str]:

    # Local paths
    paths = set(self.list_source_objects(source_folder=source))

    # Getting the keys (remote s3 paths).
    objects = self.list_bucket_objects(dest)
    object_keys = set([obj['Key'] for obj in objects])

    # Compute the set difference: What we have in paths that does
    # not exists in object_keys.
    to_sync = paths - object_keys

    sournce_path = Path(source)
    for path in to_sync:
        self._s3.upload_file(str(sournce_path / path),
                                Bucket=dest, Key=path)

Search in sets is O(1) so, using sets the whole thing would be O(n) way faster that the previous O( m * log(n) ).

Further improvements

The code could be improved even more making methods list_bucket_objects and list_source_objects to return sets instead of list.

like image 139
Raydel Miranda Avatar answered Oct 16 '22 18:10

Raydel Miranda