I'm currently building a system where S3 will be used as a persistent hash-set (the S3 URL is inferred from the data) by lots of computers across the Internet. If two nodes store the same data then it will be stored using the same key and it will therefore not be stored twice. When an object is removed I need to know whether some other node(s) is using that data as well. In that case I will not remove it.
Right now I've implemented it by adding a list of the storing nodes as part of the data written to S3. So when a node is storing the data the following happens:
This create a form of idempotent reference counting. Since requests over the Internet can be quite unreliable I don't want to just count the number of storing nodes. That's why I'm storing a list instead of a counter (in case a node sends the same request >1 times).
This approach works as long as two nodes are not writing simultaneously. S3 doesn't (as far as I know) provide any way to lock the object so that all these 5 steps become atomic.
How would you solve this concurrency issue? I'm considering implementing some form of optimistic concurrency. How should I do that for S3? Should I perhaps use a completely different approach?
Amazon S3 doesn't have any limits for the number of connections made to your bucket.
Amazon S3 now provides increased performance to support at least 3,500 requests per second to add data and 5,500 requests per second to retrieve data, which can save significant processing time for no additional charge.
Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.
Consider first separating the lock list from your (protected) data. Create a separate bucket specific to your data to contain the lock list (bucket name should be a derivative of your data object name). Use individual files in that second bucket (one per node, with the object name derived from the node name). Nodes add a new object to the second bucket before accessing the protected data, nodes remove their object from the second bucket when they're finished.
This allows you to enumerate the second bucket to determine if your data is locked. And allows two nodes to update the lock list simultaneously without conflict.
To add onto what amadeus said, if your needs aren't relational, you can even use AWS' SimpleDB, significantly cheaper.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With