I would like to consume a DynamoDB Stream from a Spark Streaming application.
Spark streaming uses KCL to read from Kinesis. There is a lib to make KCL able to read from a DynamoDB Stream: dynamodb-streams-kinesis-adapter.
But is it possible to plug this lib into spark? Anyone done this?
I'm using Spark 2.1.0.
My backup plan is to have another app reading from DynamoDB stream into a Kinesis stream.
Thanks
The way to do this it to implement the KinesisInputDStream to use the worker provided by dynamodb-streams-kinesis-adapter
The official guidelines suggest something like this:
final Worker worker = StreamsWorkerFactory
.createDynamoDbStreamsWorker(
recordProcessorFactory,
workerConfig,
adapterClient,
amazonDynamoDB,
amazonCloudWatchClient);
From the Spark's perspective, it is implemented under the kinesis-asl module in KinesisInputDStream.scala
I have tried this for Spark 2.4.0. Here is my repo. It needs little refining but gets the work done
https://github.com/ravi72munde/spark-dynamo-stream-asl
After modifying the KinesisInputDStream, we can use it as shown below.
val stream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName("sample-tablename-2")
.regionName("us-east-1")
.initialPosition(new Latest())
.checkpointAppName("sample-app")
.checkpointInterval(Milliseconds(100))
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With