I am building real time processing for detecting fraud ATM card transaction. in order to efficiently detect fraud, logic requires to have last transaction date by card, sum of transaction amount by day (or last 24 Hrs.)
One of usecase is if card transaction outside native country for more than a 30 days of last transaction in that country then send alert as possible fraud
So tried to look at Spark streaming as a solution. In order to achieve this (probably I am missing idea about functional programming) below is my psudo code
stream=ssc.receiverStream() //input receiver
s1=stream.mapToPair() // creates key with card and transaction date as value
s2=stream.reduceByKey() // applies reduce operation for last transaction date
s2.checkpoint(new Duration(1000));
s2.persist();
I am facing two problem here
1) how to use this last transaction date further for future comparison from same card
2) how to persist data so even if restart drive program then old values of s2 restores back
3) updateStateByKey
can used to maintain historical state?
I think I am missing key point of spark streaming/functional programming that how to implement this kind of logic.
Spark Streaming currently has two implementations for stateful streams. One is the older PairRDDFunctions.updateStateByKey (Spark <= 1.5.0) , which uses a CoGroupedRDD to store the state for each key. The newer version called PairRDDFunctions.mapWithState (Spark >= 1.6.0) uses a OpenHashMapBasedStateMap [K, V] to store the internal state.
In production, Spark Streaming uses ZooKeeper and HDFS for high availability. Spark Streaming is developed as part of Apache Spark. It thus gets tested and updated with each Spark release. If you have questions about the system, ask on the Spark mailing lists . The Spark Streaming developers welcome contributions.
In old Spark Streaming, State Management was quite inefficient due to which it was not fit for Stateful processing. It was because of 2 major limitations in its design : In every micro-batch, the state was persisted along with the checkpoint metadata (i.e. offsets or progress of streaming).
Spark Streaming can read data from HDFS , Flume , Kafka , Twitter and ZeroMQ . You can also define your own custom data sources. You can run Spark Streaming on Spark's standalone cluster mode or other supported cluster resource managers. It also includes a local run mode for development.
If you are using Spark Streaming you shouldn't really save your state on a file, especially if you are planning to run your application 24/7. If that is not your intention, you will be probably be fine with just a Spark application since you are facing only big data computation and not computation over batches coming real time.
Yes, updateStateByKey can be used to maintain state through the various batches but it has a particular signature that you can see in the docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions
Also persist() it's just a form of caching, it doesn't actually persist your data on disk (like on a file).
Hope to have clarified some of your doubts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With