Performance of listing S3 bucket with prefix and delimiter

Tags:

According to the listing documentation it is possible to treat a large navigate number of keys as though they were hierarchial. I am planning to store a large number of keys (let's say a few hundred million), distributed over a sensible-sized 'hierarchy'.

What is the performance of using a prefix and delimiter? Does it require a full enumeration of keys at the S3 end, and therefore be an O(n) operation? I have no idea whether keys are stored in a big hash table, or whether they have indexing data structures, or if they're stored in a tree or what.

I want to avoid the situation where I have a very large number of keys and navigating the 'hierarchy' suddenly becomes difficult.

So if I have the following keys:

abc/def/ghi/0
abc/def/ghi/1
abc/def/ghi/...
abc/def/ghi/100,000,000,000

Will it affect the speed of the query Delimiter='/, Prefix='abc/def'?

722

asked Aug 13 '16 08:08

Joe

2 Answers

Aside from the Request Rate and Performance Considerations document that Sandeep referenced (which is not applicable to your use case), AWS hasn't publicized very much regarding S3 performance. It's probably private intellectual property. So I doubt you'll find very much information unless you can get it somehow from AWS directly.

However, some things to keep in mind:

Amazon S3 is built for massive scale. Millions of companies are using S3 with millions of keys in millions of buckets.
AWS promotes the prefix + delimiter as a very valid use case.
There are common data structures and algorithms used in computer science that AWS is probably using behind the scenes to efficiently retrieve keys. One such data structure is called a Trie or Prefix Tree.

Based on all of the above, chances are that it's much better than an order O(n) algorithm when you retrieve listing of keys. I think you are safe to use prefixes and delimiters for your hierarchy.

183

answered Sep 23 '22 22:09

Matt Houser

As long as you are not using a continuous sequence (such as date 2016-13-08, 2016-13-09 and so on) in the prefix you shouldn't face any problem. If your keys are auto-generated as a continuous sequence then prepend a randomly generated hash key to the keys (aidk-2016-13-08, ujlk-2016-13-09). The amazon documentation says:

Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.

http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

answered Sep 24 '22 22:09

Sandeep Kumar

Related questions
                            
                                Serving private content from CloudFront with Signed Cookies
                            
                                CloudFormation SQS Queue Redrive policy dependency on a DLQ created
                            
                                How to get a response from NSURLSessionDownloadTask downloadTaskWithRequest
                            
                                Failing to deploy a Laravel app to EC2
                            
                                Change hostname for AWS ec2 instance
                            
                                How can I detect if a tcp connection has been forwarded from a ssl connection?
                            
                                Difference between S3 public files and static websites
                            
                                How to rotate logs in AWS CloudWatch?
                            
                                Flask-SQLAlchemy ssl-connection with AWS RDS error
                            
                                Load Balancer $_SERVER['REMOTE_ADDR'] Not working
                            
                                Export from Amazon Redshift into an RDS MySQL database
                            
                                Do I have have to use Amazon Route 53's DNS Service (and pay for it), if I register and manage my domain with them?
                            
                                loading Redshift from S3 (with partitions)
                            
                                Docker container keeps growing
                            
                                AWS API Gateway accept Content-type: application/xml
                            
                                DynamoDB count operation capacity units consumption
                            
                                Install pgAgent on AWS RDS for Postgres
                            
                                AWS IAM Policy to allow user to create IAM Roles (from Management Console & AWS CLI)
                            
                                AWS Lambda - callback("some error type") equivalent in Java 8
                            
                                Migrating from SQL Server to AWS Aurora

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of listing S3 bucket with prefix and delimiter

Tags:

amazon-web-services

amazon-s3

Joe

People also ask

2 Answers

Matt Houser

Sandeep Kumar

Recent Activity

Donate For Us