Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

amazon s3 partitioning of files best practices

Hopefully a simple question - apologies if it's already been answered but nothing came up in search.

On S3 is it better to organizes images into smaller subdirectories, or just keep them all in one directory? In a typical filesystem one would namespace the images in directories to improve performance. A flat structure with thousands of images in one directory doesn't normally perform well. Is this the case on Amazon S3?

I can put all user images into a users folder, all post images into a posts folder, etc. OR I can put user images into folders like users/{userId} to avoid having thousands of images in one users folder.

like image 674
Marc Avatar asked Mar 10 '12 16:03

Marc


People also ask

Is it better to have multiple S3 buckets or one bucket with sub folders?

The total volume of data and number of objects you can store are unlimited. Also the documentation states there is no performance difference between using a single bucket or multiple buckets so I guess both option 1 and 2 would be suitable for you.

Can S3 be partitioned?

When you configure the Amazon S3 connector to read in parallel, each node can read part of the same file or each node can read one or more different files. By default, the Amazon S3 file that you specify in the File name property is range partitioned. Each node reads approximately the same number of rows from the file.

What is partitioning S3?

How partitioning works: folders where data is stored on S3, which are physical entities, are mapped to partitions, which are logical entities, in a metadata store such as Glue Data Catalog or Hive Metastore.

Should I use folders in S3?

Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower. The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.


1 Answers

Update 2018-07

It is no longer required to account for performance when devising a partitioning scheme for your use case, see my InfoQ summary Amazon S3 Increases Request Rate Performance and Drops Randomized Prefix Requirement for details:

Amazon Web Services (AWS) recently announced significantly increased S3 request rate performance and the ability to parallelize requests to scale to the desired throughput. Notably this performance increase also "removes any previous guidance to randomize object prefixes" and enables the use of "logical or sequential naming patterns in S3 object naming without any performance implications".

Update 2013-09

The information in the referenced link, while still largely accurate, has been supplanted by a newer document, S3 Request Rate and Performance Considerations.


Initial answer

This is a problem with Amazon S3 as well, albeit only for significant storage requirements, see Amazon S3 Performance Tips & Tricks for a detailed answer including strategies for partitioning your object space.

like image 64
Steffen Opel Avatar answered Sep 18 '22 08:09

Steffen Opel