Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the current batch timestamp in Spark streaming

How to get the current batch timestamp (DStream) in Spark streaming?

I have a spark streaming application where the input data will under go many transformations.

I need the current timestamp during the execution to validate the timestamp in input data.

If I compare with the current time then timestamp might differ from each RDD transformation execution.

Is there any way to get the timestamp, when the particular Spark streaming micro batch has started or which micro batch interval it belongs?

like image 234
Vijay Innamuri Avatar asked Dec 18 '22 21:12

Vijay Innamuri


2 Answers

dstream.foreachRDD((rdd, time)=> {
  // time is scheduler time for the batch job.it's interval was your window/slide length.
})
like image 171
XIO Avatar answered Dec 21 '22 09:12

XIO


dstream.transform(
    (rdd, time) => {
        rdd.map(
            (time, _)
        )
    }
).filter(...)
like image 44
Piotr Uchman Avatar answered Dec 21 '22 11:12

Piotr Uchman