In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data.
I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the time it takes to consume that specific dataset and then stop it. For specific reasons we decided not to use the batch version of spark.
Hence is there any timeout or something that can be used to detect that there is no more data coming it and that everything has be processed.
Thank you
Now the application will continuously check for that file at specified location. Now , If you want to stop it gracefully, you only need to delete that file and job will stop only after finishing current processing batch as well as queued batches so there is no lose of data.
Users specify a streaming computation by writing a batch computation (using Spark's DataFrame/Dataset API), and the engine automatically incrementalizes this computation (runs it continuously).
exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.
Structured Streaming Monitoring Options
You can use query.lastProgress to get the timestamp and build logic around that. Don't forget to save your checkpoint to a durable, persistent, available store.
Putting together a couple pieces of advice:
So one option is to periodically check for query activity, dynamically shutting down depending on a configurable state (when you determine no further progress can/should be made):
// where you configure your spark job...
spark.streams.addListener(shutdownListener(spark))
// your job code starts here by calling "start()" on the stream...
// periodically await termination, checking for your shutdown state
while(!spark.sparkContext.isStopped) {
if (shutdown) {
println(s"Shutting down since first batch has completed...")
spark.streams.active.foreach(_.stop())
spark.stop()
} else {
// wait 10 seconds before checking again if work is complete
spark.streams.awaitAnyTermination(10000)
}
}
Your listener can dynamically shutdown in a variety of ways. For instance, if you're only waiting on a single batch, then just shutdown after the first update:
var shutdown = false
def shutdownListener(spark: SparkSession) = new StreamingQueryListener() {
override def onQueryStarted(_: QueryStartedEvent): Unit = println("Query started: " + queryStarted.id)
override def onQueryTerminated(_: QueryTerminatedEvent): Unit = println("Query terminated! " + queryTerminated.id)
override def onQueryProgress(_: QueryProgressEvent): Unit = shutdown = true
}
Or, if you need to shutdown after more complicated state changes, you could parse the json body of the queryProgress.progress
to determine whether or not to shutdown at a given onQueryUpdate
event firing.
You can probably use this:-
def stopStreamQuery(query: StreamingQuery, awaitTerminationTimeMs: Long) {
while (query.isActive) {
try{
if(query.lastProgress.numInputRows < 10){
query.awaitTermination(1000)
}
}
catch
{
case e:NullPointerException => println("First Batch")
}
Thread.sleep(500)
}
}
You can make a numInputRows variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With