Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of listing S3 bucket with prefix and delimiter

According to the listing documentation it is possible to treat a large navigate number of keys as though they were hierarchial. I am planning to store a large number of keys (let's say a few hundred million), distributed over a sensible-sized 'hierarchy'.

What is the performance of using a prefix and delimiter? Does it require a full enumeration of keys at the S3 end, and therefore be an O(n) operation? I have no idea whether keys are stored in a big hash table, or whether they have indexing data structures, or if they're stored in a tree or what.

I want to avoid the situation where I have a very large number of keys and navigating the 'hierarchy' suddenly becomes difficult.

So if I have the following keys:

  • abc/def/ghi/0
  • abc/def/ghi/1
  • abc/def/ghi/...
  • abc/def/ghi/100,000,000,000

Will it affect the speed of the query Delimiter='/, Prefix='abc/def'?

like image 722
Joe Avatar asked Aug 13 '16 08:08

Joe


People also ask

What is the best way to get better performance for storing several files in S3?

Although S3 bucket names are globally unique, each bucket is stored in a Region that you select when you create the bucket. To optimize performance, we recommend that you access the bucket from Amazon EC2 instances in the same AWS Region when possible. This helps reduce network latency and data transfer costs.

What is the maximum throughput for S3 put Post copy delete operations on a per prefix basis?

You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an Amazon S3 bucket. There are no limits to the number of prefixes that you can have in your bucket.

What is delimiter in S3 list objects?

A delimiter is a character you use to group keys. Encoding type used by Amazon S3 to encode object keys in the response. The owner field is not present in listV2 by default, if you want to return owner field with each key in the result then set the fetch owner field to true.

What is the prefix of S3 bucket?

A key prefix is a string of characters that can be the complete path in front of the object name (including the bucket name). For example, if an object (123. txt) is stored as BucketName/Project/WordFiles/123. txt, the prefix might be “BucketName/Project/WordFiles/123.


2 Answers

Aside from the Request Rate and Performance Considerations document that Sandeep referenced (which is not applicable to your use case), AWS hasn't publicized very much regarding S3 performance. It's probably private intellectual property. So I doubt you'll find very much information unless you can get it somehow from AWS directly.

However, some things to keep in mind:

  1. Amazon S3 is built for massive scale. Millions of companies are using S3 with millions of keys in millions of buckets.
  2. AWS promotes the prefix + delimiter as a very valid use case.
  3. There are common data structures and algorithms used in computer science that AWS is probably using behind the scenes to efficiently retrieve keys. One such data structure is called a Trie or Prefix Tree.

Based on all of the above, chances are that it's much better than an order O(n) algorithm when you retrieve listing of keys. I think you are safe to use prefixes and delimiters for your hierarchy.

like image 183
Matt Houser Avatar answered Sep 23 '22 22:09

Matt Houser


As long as you are not using a continuous sequence (such as date 2016-13-08, 2016-13-09 and so on) in the prefix you shouldn't face any problem. If your keys are auto-generated as a continuous sequence then prepend a randomly generated hash key to the keys (aidk-2016-13-08, ujlk-2016-13-09). The amazon documentation says:

Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.

http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

like image 33
Sandeep Kumar Avatar answered Sep 24 '22 22:09

Sandeep Kumar