I'm looking to convert a large directory of high resolution images (several million) into thumbnails using Python. I have a DynamoDB table that stores the location of each image in S3.
Instead of processing all these images on one EC2 instance (would take weeks) I'd like to write a distributed application using a bunch of instances.
What techniques could I use to write a queue that would allow a node to "check out" an image from the database, resize it, and update the database with the new dimensions of the generated thumbnails?
Specifically I'm worried about atomicity and concurrency -- how can I prevent two nodes from checking out the same job at the same time with DynamoDB?
DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database so that you don't have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.
The DynamoDB writer is supported in AWS Glue version 1.0 or later. AWS Glue supports writing data into another AWS account's DynamoDB table.
In Amazon DynamoDB, you can use either the DynamoDB API, or PartiQL, a SQL-compatible query language, to add an item to a table. With the DynamoDB API, you use the PutItem operation to add an item to a table. The primary key for this table consists of Artist and SongTitle. You must specify values for these attributes.
One approach you could take would be to use Amazon's Simple Queue Service(SQS) in conjunction with DynamoDB. So what you could do is write messages to the queue that contain something like the hash key of the image entry in DynamoDB. Each instance would periodically check the queue and grab messages off. When an instance grabs a message off the queue, it becomes invisible to other instances for a given amount of time. You can then look up and process the image and delete the message off the queue. If for some reason something goes wrong with processing the image, the message will not be deleted and it will become visible for other instances to grab.
Another, probably more complicated, approach would be to use DynamoDB's conditional update mechanism to implement a locking scheme. For example, you could add something a 'beingProcessed' attribute to your data model, that is either 0 or 1. The first thing an instance could do is perform a conditional update on this column, changing the value to 1 iff the initial value is 0. There is probably more to do here around making it a proper/robust locking mechanism....
Using DynamoDB's optimistic locking with versioning would allow a node to "check out" a job by updating a status field to "InProgress". If a different node tried checking out the same job by updating the status field, it would receive an error and would know to retrieve a different job.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html
I know this is an old question, so this answer is more for the community than the original poster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With