I have an application that reads a message from SQS (let's call the queue "p" ), does computationally expensive image processing ( step #1 ), uploads the result to S3 and deletes the message from the queue "p" and then sends a notification to a SNS topic ( this SNS topic routes the message to another queue called "q" ). There is another application that reads from queue "q" and does the second stage of the image processing ( downloads the result of step #1 from S3 and does additional mathematical operations on that result ).
I have a combination of regular instances + spot instances running the step #1 application. I know that ( because of the SQS visibility time-out concept ) if the spot instances get shut down during image processing phase , SQS makes the messages visible again to other consumers so the non-spot EC2 instances will eventually do the work that the spot instances did not manage to complete due to the system shutdown.
Now my question is : what happens if the spot instances get shut down exactly after the delete but before a message is sent to SNS ? How can we recover from such an event ?
# PSEUDO CODE
msg = read message from queue
result = doWork(msg)
upload result to S3
publish to sns about result
First of all, process A should not delete the message from its SQS queue until AFTER it has sent the SNS message to kick of the second process. Deleting the message from the queue is the very last thing you should do to signal that 'my work is done'. Until the SNS message is sent, the work is not done.
Secondly, one of the key things that you need to embrace when designing processes like this, (and especially when using spot instances) is the concept of Idempotence: http://en.wikipedia.org/wiki/Idempotence
A unary operation (or function) is idempotent if, whenever it is applied twice to any value, it gives the same result as if it were applied once
Further more: http://aws.amazon.com/sqs/faqs/#How_many_times_will_I_receive_each_message
Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.
What this ultimately means, whether or not a spot instance gets shut down mid-process, there is the real possibility, that a given message in an SQS queue will be simultaneously delivered to multiple worker processes or delivered to the same process more than once, either because SQS sent it twice, or the spot fails after SNS message is sent but before the SQS queue is updated.
Without knowing exactly what your processing entails I couldn't tell you how to make your process idempotent, but don't try to solve the problem 'what happens if the spot instances gets shutdown mid-stream', think about 'how do I design each step in the process so that it can be run multiple times, with the same inputs and not cause any problems - if you do that, you will kill two birds with one stone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!Donate Us With