Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

S3 performance for LIST by prefix with millions of objects in a single bucket

I have a project where there will be about 80 million objects in an S3 bucket. Every day, I will be deleting about 4 million and adding 4 million. The object names will be in a pseudo directory structure:

/012345/0123456789abcdef0123456789abcdef

For deletion, I will need to list all objects with a prefix of 012345/, and then delete them. I am concerned of the time it will take for this LIST operation. While it seems clear that S3's access time for individual assets does not increase for individual objects, I haven't found anything definitive that says that a LIST operation over 80MM objects, searching for 10 objects that all have the same prefix will remain fast in such a large bucket.

In a side comment on a question about the maximum number of objects that can be stored in a bucket (from 2008):

In my experience, LIST operations do take (linearly) longer as object count increases, but this is probably a symptom of the increased I/O required on the Amazon servers, and down the wire to your client.

From the Amazon S3 documentation:

There is no limit to the number of objects that can be stored in a bucket and no difference in performance whether you use many buckets or just a few. You can store all of your objects in a single bucket, or you can organize them across several buckets.

While I am inclined to believe the Amazon documentation, it isn't entirely clear what operations their comment is directed to.

Before committing to this expensive plan, I would like to definitively know if LIST operations when searching by prefix remain fast when buckets contain millions of objects. If someone has real-world experience with such large buckets, I would love to hear your input.

like image 365
Brad Avatar asked Jul 31 '14 14:07

Brad


People also ask

What is the maximum GET requests per second based on bucket prefixes?

You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an Amazon S3 bucket. There are no limits to the number of prefixes that you can have in your bucket.

How often can you expect to lose data if you store 10000000 objects in S3?

As AWS notes, “If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years.”

What is the best way to get better performance for storing several files in S3?

Although S3 bucket names are globally unique, each bucket is stored in a Region that you select when you create the bucket. To optimize performance, we recommend that you access the bucket from Amazon EC2 instances in the same AWS Region when possible. This helps reduce network latency and data transfer costs.

Which is the maximum S3 object size for upload in a single PUT operation?

Upload an object in a single operation using the AWS SDKs, REST API, or AWS CLI—With a single PUT operation, you can upload a single object up to 5 GB in size. Upload a single object using the Amazon S3 Console—With the Amazon S3 Console, you can upload a single object up to 160 GB in size.


2 Answers

Prefix searches are fast, if you've chosen the prefixes correctly. Here's an explanation: https://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/

like image 158
r3m0t Avatar answered Sep 21 '22 10:09

r3m0t


I've never seen a problem, but why would you ever list a million files just to pull a few files out of the list? It's not S3 performance, it's likely do to the call just taking longer.

Why not store the file names in a database, index them, then query from there. That'd be a better solution I'd think.

like image 39
Paul Frederiksen Avatar answered Sep 21 '22 10:09

Paul Frederiksen