I've noticed there is no API in boto3 for the "sync" operation that you can perform through the command line.
So,
How do I sync a local folder to a given bucket using boto3?
To upload folders and files to an S3 bucketSign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/ . In the Buckets list, choose the name of the bucket that you want to upload your folders or files to. Choose Upload.
Run the following command: aws s3 sync . s3://<your-bucket-name>/ This will sync all files from the current directory to your bucket's root directory, uploading any that are outdated or missing in the bucket.
Setting up a client To access any AWS service with Boto3, we have to connect to it with a client. Here, we create an S3 client. We specify the region in which our data lives. We also have to pass the access key and the password, which we can generate in the AWS console, as described here.
Can we create folder in S3 bucket using Python? There is no concept of folders or directories in S3. You can create file names like "abc/xys/uvw/123. jpg" , which many S3 access tools like S3Fox show like a directory structure, but it's actually just a single file in a bucket.
I've just implemented a simple class for this matter. I'm posting it here hoping it help anyone with the same issue.
You could modify S3Sync.sync in order to take file size into account.
class S3Sync:
"""
Class that holds the operations needed for synchronize local dirs to a given bucket.
"""
def __init__(self):
self._s3 = boto3.client('s3')
def sync(self, source: str, dest: str) -> [str]:
"""
Sync source to dest, this means that all elements existing in
source that not exists in dest will be copied to dest.
No element will be deleted.
:param source: Source folder.
:param dest: Destination folder.
:return: None
"""
paths = self.list_source_objects(source_folder=source)
objects = self.list_bucket_objects(dest)
# Getting the keys and ordering to perform binary search
# each time we want to check if any paths is already there.
object_keys = [obj['Key'] for obj in objects]
object_keys.sort()
object_keys_length = len(object_keys)
for path in paths:
# Binary search.
index = bisect_left(object_keys, path)
if index == object_keys_length:
# If path not found in object_keys, it has to be sync-ed.
self._s3.upload_file(str(Path(source).joinpath(path)), Bucket=dest, Key=path)
def list_bucket_objects(self, bucket: str) -> [dict]:
"""
List all objects for the given bucket.
:param bucket: Bucket name.
:return: A [dict] containing the elements in the bucket.
Example of a single object.
{
'Key': 'example/example.txt',
'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
'ETag': '"b11564415be7f58435013b414a59ae5c"',
'Size': 115280,
'StorageClass': 'STANDARD',
'Owner': {
'DisplayName': 'webfile',
'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
}
}
"""
try:
contents = self._s3.list_objects(Bucket=bucket)['Contents']
except KeyError:
# No Contents Key, empty bucket.
return []
else:
return contents
@staticmethod
def list_source_objects(source_folder: str) -> [str]:
"""
:param source_folder: Root folder for resources you want to list.
:return: A [str] containing relative names of the files.
Example:
/tmp
- example
- file_1.txt
- some_folder
- file_2.txt
>>> sync.list_source_objects("/tmp/example")
['file_1.txt', 'some_folder/file_2.txt']
"""
path = Path(source_folder)
paths = []
for file_path in path.rglob("*"):
if file_path.is_dir():
continue
str_file_path = str(file_path)
str_file_path = str_file_path.replace(f'{str(path)}/', "")
paths.append(str_file_path)
return paths
if __name__ == '__main__':
sync = S3Sync()
sync.sync("/temp/some_folder", "some_bucket_name")
@Z.Wei commented:
Dig into this a little to deal with the weird bisect function. We may just use if path not in object_keys:?
I think is a interesting question that worth an answer update and not get lost in the comments.
Answer:
No, if path not in object_keys
would performs a linear search O(n). bisect_* performs a binary search (list has to be ordered) which is O(log(n)).
Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in keyword.
Take in to account that you must check every path in the source against every path in the destination making the use of in
O(m * n), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( n * log(n) )
If I think about it, you could use sets to make the algorithm even faster (and simple, hence more pythonic):
def sync(self, source: str, dest: str) -> [str]:
# Local paths
paths = set(self.list_source_objects(source_folder=source))
# Getting the keys (remote s3 paths).
objects = self.list_bucket_objects(dest)
object_keys = set([obj['Key'] for obj in objects])
# Compute the set difference: What we have in paths that does
# not exists in object_keys.
to_sync = paths - object_keys
sournce_path = Path(source)
for path in to_sync:
self._s3.upload_file(str(sournce_path / path),
Bucket=dest, Key=path)
Search in sets
is O(1) so, using sets the whole thing would be O(n) way faster that the previous O( m * log(n) ).
The code could be improved even more making methods list_bucket_objects
and list_source_objects
to return sets instead of list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With