Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Streaming historical state

I am building real time processing for detecting fraud ATM card transaction. in order to efficiently detect fraud, logic requires to have last transaction date by card, sum of transaction amount by day (or last 24 Hrs.)

One of usecase is if card transaction outside native country for more than a 30 days of last transaction in that country then send alert as possible fraud

So tried to look at Spark streaming as a solution. In order to achieve this (probably I am missing idea about functional programming) below is my psudo code

stream=ssc.receiverStream() //input receiver 
s1=stream.mapToPair() // creates key with card and transaction date as value
s2=stream.reduceByKey() // applies reduce operation for last transaction date 
s2.checkpoint(new Duration(1000));
s2.persist();

I am facing two problem here

1) how to use this last transaction date further for future comparison from same card
2) how to persist data so even if restart drive program then old values of s2 restores back 3) updateStateByKey can used to maintain historical state?

I think I am missing key point of spark streaming/functional programming that how to implement this kind of logic.

like image 998
Jigar Parekh Avatar asked Jun 20 '14 16:06

Jigar Parekh


People also ask

How are stateful streams implemented in spark?

Spark Streaming currently has two implementations for stateful streams. One is the older PairRDDFunctions.updateStateByKey (Spark <= 1.5.0) , which uses a CoGroupedRDD to store the state for each key. The newer version called PairRDDFunctions.mapWithState (Spark >= 1.6.0) uses a OpenHashMapBasedStateMap [K, V] to store the internal state.

What is Spark Streaming?

In production, Spark Streaming uses ZooKeeper and HDFS for high availability. Spark Streaming is developed as part of Apache Spark. It thus gets tested and updated with each Spark release. If you have questions about the system, ask on the Spark mailing lists . The Spark Streaming developers welcome contributions.

What are the limitations of state management in Spark Streaming?

In old Spark Streaming, State Management was quite inefficient due to which it was not fit for Stateful processing. It was because of 2 major limitations in its design : In every micro-batch, the state was persisted along with the checkpoint metadata (i.e. offsets or progress of streaming).

What are the data sources for Apache Spark Streaming?

Spark Streaming can read data from HDFS , Flume , Kafka , Twitter and ZeroMQ . You can also define your own custom data sources. You can run Spark Streaming on Spark's standalone cluster mode or other supported cluster resource managers. It also includes a local run mode for development.


1 Answers

If you are using Spark Streaming you shouldn't really save your state on a file, especially if you are planning to run your application 24/7. If that is not your intention, you will be probably be fine with just a Spark application since you are facing only big data computation and not computation over batches coming real time.

Yes, updateStateByKey can be used to maintain state through the various batches but it has a particular signature that you can see in the docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions

Also persist() it's just a form of caching, it doesn't actually persist your data on disk (like on a file).

Hope to have clarified some of your doubts.

like image 102
gprivitera Avatar answered Oct 19 '22 23:10

gprivitera