Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrency in Amazon S3

I'm currently building a system where S3 will be used as a persistent hash-set (the S3 URL is inferred from the data) by lots of computers across the Internet. If two nodes store the same data then it will be stored using the same key and it will therefore not be stored twice. When an object is removed I need to know whether some other node(s) is using that data as well. In that case I will not remove it.

Right now I've implemented it by adding a list of the storing nodes as part of the data written to S3. So when a node is storing the data the following happens:

  1. Read the object from S3.
  2. Deserialize the object.
  3. Add the new node's id to the list of storing nodes.
  4. Serialize the new object (the data to store and the node-list).
  5. Write the serialized data to S3.

This create a form of idempotent reference counting. Since requests over the Internet can be quite unreliable I don't want to just count the number of storing nodes. That's why I'm storing a list instead of a counter (in case a node sends the same request >1 times).

This approach works as long as two nodes are not writing simultaneously. S3 doesn't (as far as I know) provide any way to lock the object so that all these 5 steps become atomic.

How would you solve this concurrency issue? I'm considering implementing some form of optimistic concurrency. How should I do that for S3? Should I perhaps use a completely different approach?

like image 499
Yrlec Avatar asked Jun 08 '11 09:06

Yrlec


People also ask

How many connections can S3 handle?

Amazon S3 doesn't have any limits for the number of connections made to your bucket.

How many requests per second can S3 handle?

Amazon S3 now provides increased performance to support at least 3,500 requests per second to add data and 5,500 requests per second to retrieve data, which can save significant processing time for no additional charge.

What are the limitations of S3?

Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.

How do I maximize the read speed on Amazon S3?

You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.


2 Answers

Consider first separating the lock list from your (protected) data. Create a separate bucket specific to your data to contain the lock list (bucket name should be a derivative of your data object name). Use individual files in that second bucket (one per node, with the object name derived from the node name). Nodes add a new object to the second bucket before accessing the protected data, nodes remove their object from the second bucket when they're finished.

This allows you to enumerate the second bucket to determine if your data is locked. And allows two nodes to update the lock list simultaneously without conflict.

like image 100
Tails Avatar answered Oct 07 '22 12:10

Tails


To add onto what amadeus said, if your needs aren't relational, you can even use AWS' SimpleDB, significantly cheaper.

like image 33
tim Avatar answered Oct 07 '22 11:10

tim