Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to implement a reliable receiver which supports non-graceful shutdown?

I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.

  1. Spark queue receiver pulls a batch of messages from the queue.
  2. Spark queue receiver stores the batch of messages into the write-ahead log.
  3. Spark application is terminated before an ack is sent to the queue.
  4. Spark application starts up again.
  5. The messages in the write-ahead log are processed through the streaming application.
  6. Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
  7. ...

Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?

like image 409
stphung Avatar asked Jul 29 '15 18:07

stphung


1 Answers

Bottom line: It depends on your output operation.

Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.

In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.

For further information on the Direct API and how to use it, check out this blog post by Databricks.

like image 91
ofirski Avatar answered Sep 22 '22 05:09

ofirski