Hopefully a simple question - apologies if it's already been answered but nothing came up in search.
On S3 is it better to organizes images into smaller subdirectories, or just keep them all in one directory? In a typical filesystem one would namespace the images in directories to improve performance. A flat structure with thousands of images in one directory doesn't normally perform well. Is this the case on Amazon S3?
I can put all user images into a users folder, all post images into a posts folder, etc. OR I can put user images into folders like users/{userId} to avoid having thousands of images in one users folder.
The total volume of data and number of objects you can store are unlimited. Also the documentation states there is no performance difference between using a single bucket or multiple buckets so I guess both option 1 and 2 would be suitable for you.
When you configure the Amazon S3 connector to read in parallel, each node can read part of the same file or each node can read one or more different files. By default, the Amazon S3 file that you specify in the File name property is range partitioned. Each node reads approximately the same number of rows from the file.
How partitioning works: folders where data is stored on S3, which are physical entities, are mapped to partitions, which are logical entities, in a metadata store such as Glue Data Catalog or Hive Metastore.
Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower. The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.
It is no longer required to account for performance when devising a partitioning scheme for your use case, see my InfoQ summary Amazon S3 Increases Request Rate Performance and Drops Randomized Prefix Requirement for details:
Amazon Web Services (AWS) recently announced significantly increased S3 request rate performance and the ability to parallelize requests to scale to the desired throughput. Notably this performance increase also "removes any previous guidance to randomize object prefixes" and enables the use of "logical or sequential naming patterns in S3 object naming without any performance implications".
The information in the referenced link, while still largely
accurate, has been supplanted by a newer document, S3 Request Rate and Performance Considerations.
This is a problem with Amazon S3 as well, albeit only for significant storage requirements, see Amazon S3 Performance Tips & Tricks for a detailed answer including strategies for partitioning your object space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With