Sync local folder to s3 bucket using boto3

1 Answers

I've just implemented a simple class for this matter. I'm posting it here hoping it help anyone with the same issue.

You could modify S3Sync.sync in order to take file size into account.

class S3Sync:
    """
    Class that holds the operations needed for synchronize local dirs to a given bucket.
    """

    def __init__(self):
        self._s3 = boto3.client('s3')

    def sync(self, source: str, dest: str) -> [str]:
        """
        Sync source to dest, this means that all elements existing in
        source that not exists in dest will be copied to dest.

        No element will be deleted.

        :param source: Source folder.
        :param dest: Destination folder.

        :return: None
        """

        paths = self.list_source_objects(source_folder=source)
        objects = self.list_bucket_objects(dest)

        # Getting the keys and ordering to perform binary search
        # each time we want to check if any paths is already there.
        object_keys = [obj['Key'] for obj in objects]
        object_keys.sort()
        object_keys_length = len(object_keys)
        
        for path in paths:
            # Binary search.
            index = bisect_left(object_keys, path)
            if index == object_keys_length:
                # If path not found in object_keys, it has to be sync-ed.
                self._s3.upload_file(str(Path(source).joinpath(path)),  Bucket=dest, Key=path)

    def list_bucket_objects(self, bucket: str) -> [dict]:
        """
        List all objects for the given bucket.

        :param bucket: Bucket name.
        :return: A [dict] containing the elements in the bucket.

        Example of a single object.

        {
            'Key': 'example/example.txt',
            'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
            'ETag': '"b11564415be7f58435013b414a59ae5c"',
            'Size': 115280,
            'StorageClass': 'STANDARD',
            'Owner': {
                'DisplayName': 'webfile',
                'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
            }
        }

        """
        try:
            contents = self._s3.list_objects(Bucket=bucket)['Contents']
        except KeyError:
            # No Contents Key, empty bucket.
            return []
        else:
            return contents

    @staticmethod
    def list_source_objects(source_folder: str) -> [str]:
        """
        :param source_folder:  Root folder for resources you want to list.
        :return: A [str] containing relative names of the files.

        Example:

            /tmp
                - example
                    - file_1.txt
                    - some_folder
                        - file_2.txt

            >>> sync.list_source_objects("/tmp/example")
            ['file_1.txt', 'some_folder/file_2.txt']

        """

        path = Path(source_folder)

        paths = []

        for file_path in path.rglob("*"):
            if file_path.is_dir():
                continue
            str_file_path = str(file_path)
            str_file_path = str_file_path.replace(f'{str(path)}/', "")
            paths.append(str_file_path)

        return paths


if __name__ == '__main__':
    sync = S3Sync()
    sync.sync("/temp/some_folder", "some_bucket_name")

Update:

@Z.Wei commented:

Dig into this a little to deal with the weird bisect function. We may just use if path not in object_keys:?

I think is a interesting question that worth an answer update and not get lost in the comments.

Answer:

No, if path not in object_keys would performs a linear search O(n). bisect_* performs a binary search (list has to be ordered) which is O(log(n)).

Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in keyword.

Take in to account that you must check every path in the source against every path in the destination making the use of in O(m * n), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( n * log(n) )

But ...

If I think about it, you could use sets to make the algorithm even faster (and simple, hence more pythonic):

def sync(self, source: str, dest: str) -> [str]:

    # Local paths
    paths = set(self.list_source_objects(source_folder=source))

    # Getting the keys (remote s3 paths).
    objects = self.list_bucket_objects(dest)
    object_keys = set([obj['Key'] for obj in objects])

    # Compute the set difference: What we have in paths that does
    # not exists in object_keys.
    to_sync = paths - object_keys

    sournce_path = Path(source)
    for path in to_sync:
        self._s3.upload_file(str(sournce_path / path),
                                Bucket=dest, Key=path)

Search in sets is O(1) so, using sets the whole thing would be O(n) way faster that the previous O( m * log(n) ).

Further improvements

The code could be improved even more making methods list_bucket_objects and list_source_objects to return sets instead of list.

139

answered Oct 16 '22 18:10

Raydel Miranda

Related questions
                            
                                Determine allocation of values - Python
                            
                                How to vectorize a loop through pandas series when values are used in slice of another series
                            
                                Snake traversal of 2D NumPy array
                            
                                Numpy trim_zeros in 2D or 3D
                            
                                How to avoid PyCharm console crash "WARNING: QApplication was not created in the main() thread" when plotting with matplotlib?
                            
                                What is the way to use Tensor flow 2.0 object in open cv2 python and why is it so circuitous?
                            
                                Check if python dictionaries are equal, allowing small difference for floats
                            
                                What is "_ipython_canary_method_should_not_exist_"?
                            
                                How to open file as formatted string?
                            
                                Creating an order-preserving multi-value dict for Django
                            
                                Deleting existing class variable yield AttributeError
                            
                                Using tkinter -- How to clear FigureCanvasTkAgg object if exists or similar?
                            
                                Fitting a quadratic function in python without numpy polyfit
                            
                                Airflow webserver suddenly stopped starting
                            
                                Mask-RCNN with Keras : Tried to convert 'shape' to a tensor and failed. Error: None values not supported
                            
                                Unable to solve the XOR problem with just two hidden neurons in Python
                            
                                Printing a dataframe from a function nicely as in Jupyter [duplicate]
                            
                                Selection field with widget="radio" not getting required effect applied with attrs in XML file in Odoo 12
                            
                                What's the difference between `driver.execute_script("...")` and `driver.get("javascript: ..."` with geckodriver/Firefox?
                            
                                Using Hyper-parameters from H2O to re-build XGBoost in Sklearn gives Difference Performance in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sync local folder to s3 bucket using boto3

Tags:

python

synchronization

amazon-s3

boto3

Raydel Miranda

People also ask

1 Answers

Update:

But ...

Further improvements

Raydel Miranda

Recent Activity

Donate For Us