Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process DynamoDB Stream in a Spark streaming application

I would like to consume a DynamoDB Stream from a Spark Streaming application.

Spark streaming uses KCL to read from Kinesis. There is a lib to make KCL able to read from a DynamoDB Stream: dynamodb-streams-kinesis-adapter.

But is it possible to plug this lib into spark? Anyone done this?

I'm using Spark 2.1.0.

My backup plan is to have another app reading from DynamoDB stream into a Kinesis stream.

Thanks

like image 815
Raphaël Douyère Avatar asked Nov 08 '22 23:11

Raphaël Douyère


1 Answers

The way to do this it to implement the KinesisInputDStream to use the worker provided by dynamodb-streams-kinesis-adapter The official guidelines suggest something like this:

final Worker worker = StreamsWorkerFactory .createDynamoDbStreamsWorker( recordProcessorFactory, workerConfig, adapterClient, amazonDynamoDB, amazonCloudWatchClient);

From the Spark's perspective, it is implemented under the kinesis-asl module in KinesisInputDStream.scala

I have tried this for Spark 2.4.0. Here is my repo. It needs little refining but gets the work done

https://github.com/ravi72munde/spark-dynamo-stream-asl

After modifying the KinesisInputDStream, we can use it as shown below. val stream = KinesisInputDStream.builder .streamingContext(ssc) .streamName("sample-tablename-2") .regionName("us-east-1") .initialPosition(new Latest()) .checkpointAppName("sample-app") .checkpointInterval(Milliseconds(100)) .storageLevel(StorageLevel.MEMORY_AND_DISK_2) .build()

like image 86
ravi72munde Avatar answered Nov 15 '22 12:11

ravi72munde