Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add a random prefix to the key names to improve S3 performance?

Tags:

amazon-s3

You expect this bucket to immediately receive over 150 PUT requests per second. What should the company do to ensure optimal performance?

A) Amazon S3 will automatically manage performance at this scale.

B) Add a random prefix to the key names.

The correct answer was B and I'm trying to figure out why that is. Can someone please explain the significance of B and if it's still true?

like image 978
buildmaestro Avatar asked Mar 26 '17 22:03

buildmaestro


People also ask

What is AWS S3 key prefix?

A key prefix is a string of characters that can be the complete path in front of the object name (including the bucket name). For example, if an object (123. txt) is stored as BucketName/Project/WordFiles/123. txt, the prefix might be “BucketName/Project/WordFiles/123.

How do I make my S3 bucket faster?

Open the AWS S3 console and click on your bucket. Click on the Metrics tab. The Total bucket size graph in the Bucket Metrics section shows the total size of the objects in the bucket.

How do I maximize the read speed on Amazon S3?

You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.

What should be considered while naming an S3 bucket?

The following rules apply for naming buckets in Amazon S3: Bucket names must be between 3 (min) and 63 (max) characters long. Bucket names can consist only of lowercase letters, numbers, dots (.), and hyphens (-). Bucket names must begin and end with a letter or number.


2 Answers

As of a 7/17/2018 AWS announcement, hashing and random prefixing the S3 key is no longer required to see improved performance: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

like image 101
jpspesh Avatar answered Sep 29 '22 22:09

jpspesh


S3 prefixes used to be determined by the first 6-8 characters;

This has changed mid-2018 - see announcement https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

But that is half-truth. Actually prefixes (in old definition) still matter.

S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.

Learning new access patterns takes time. Repartitioning of the data takes time.

Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).

TLDR:

"Old prefixes" still do matter. Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)

"New prefixes" do work, but not initially. It takes time to accommodate to load.

PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.

like image 38
Tagar Avatar answered Sep 29 '22 21:09

Tagar