Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a distributed queue in Amazon's DynamoDB

I'm looking to convert a large directory of high resolution images (several million) into thumbnails using Python. I have a DynamoDB table that stores the location of each image in S3.

Instead of processing all these images on one EC2 instance (would take weeks) I'd like to write a distributed application using a bunch of instances.

What techniques could I use to write a queue that would allow a node to "check out" an image from the database, resize it, and update the database with the new dimensions of the generated thumbnails?

Specifically I'm worried about atomicity and concurrency -- how can I prevent two nodes from checking out the same job at the same time with DynamoDB?

like image 455
ensnare Avatar asked Sep 01 '12 13:09

ensnare


People also ask

Is DynamoDB a distributed database?

DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database so that you don't have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.

Can glue write to DynamoDB?

The DynamoDB writer is supported in AWS Glue version 1.0 or later. AWS Glue supports writing data into another AWS account's DynamoDB table.

Which API is used to write data to DynamoDB?

In Amazon DynamoDB, you can use either the DynamoDB API, or PartiQL, a SQL-compatible query language, to add an item to a table. With the DynamoDB API, you use the PutItem operation to add an item to a table. The primary key for this table consists of Artist and SongTitle. You must specify values for these attributes.


2 Answers

One approach you could take would be to use Amazon's Simple Queue Service(SQS) in conjunction with DynamoDB. So what you could do is write messages to the queue that contain something like the hash key of the image entry in DynamoDB. Each instance would periodically check the queue and grab messages off. When an instance grabs a message off the queue, it becomes invisible to other instances for a given amount of time. You can then look up and process the image and delete the message off the queue. If for some reason something goes wrong with processing the image, the message will not be deleted and it will become visible for other instances to grab.

Another, probably more complicated, approach would be to use DynamoDB's conditional update mechanism to implement a locking scheme. For example, you could add something a 'beingProcessed' attribute to your data model, that is either 0 or 1. The first thing an instance could do is perform a conditional update on this column, changing the value to 1 iff the initial value is 0. There is probably more to do here around making it a proper/robust locking mechanism....

like image 50
nstehr Avatar answered Oct 05 '22 06:10

nstehr


Using DynamoDB's optimistic locking with versioning would allow a node to "check out" a job by updating a status field to "InProgress". If a different node tried checking out the same job by updating the status field, it would receive an error and would know to retrieve a different job.

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html

I know this is an old question, so this answer is more for the community than the original poster.

like image 26
frankd Avatar answered Oct 05 '22 07:10

frankd