I have a project where there will be about 80 million objects in an S3 bucket. Every day, I will be deleting about 4 million and adding 4 million. The object names will be in a pseudo directory structure:
/012345/0123456789abcdef0123456789abcdef
For deletion, I will need to list all objects with a prefix of 012345/
, and then delete them. I am concerned of the time it will take for this LIST operation. While it seems clear that S3's access time for individual assets does not increase for individual objects, I haven't found anything definitive that says that a LIST operation over 80MM objects, searching for 10 objects that all have the same prefix will remain fast in such a large bucket.
In a side comment on a question about the maximum number of objects that can be stored in a bucket (from 2008):
In my experience, LIST operations do take (linearly) longer as object count increases, but this is probably a symptom of the increased I/O required on the Amazon servers, and down the wire to your client.
From the Amazon S3 documentation:
There is no limit to the number of objects that can be stored in a bucket and no difference in performance whether you use many buckets or just a few. You can store all of your objects in a single bucket, or you can organize them across several buckets.
While I am inclined to believe the Amazon documentation, it isn't entirely clear what operations their comment is directed to.
Before committing to this expensive plan, I would like to definitively know if LIST operations when searching by prefix remain fast when buckets contain millions of objects. If someone has real-world experience with such large buckets, I would love to hear your input.
You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an Amazon S3 bucket. There are no limits to the number of prefixes that you can have in your bucket.
As AWS notes, “If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years.”
Although S3 bucket names are globally unique, each bucket is stored in a Region that you select when you create the bucket. To optimize performance, we recommend that you access the bucket from Amazon EC2 instances in the same AWS Region when possible. This helps reduce network latency and data transfer costs.
Upload an object in a single operation using the AWS SDKs, REST API, or AWS CLI—With a single PUT operation, you can upload a single object up to 5 GB in size. Upload a single object using the Amazon S3 Console—With the Amazon S3 Console, you can upload a single object up to 160 GB in size.
Prefix searches are fast, if you've chosen the prefixes correctly. Here's an explanation: https://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/
I've never seen a problem, but why would you ever list a million files just to pull a few files out of the list? It's not S3 performance, it's likely do to the call just taking longer.
Why not store the file names in a database, index them, then query from there. That'd be a better solution I'd think.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With