Is there a way to dynamically stop Spark Structured Streaming?

Tags:

In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data.

I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the time it takes to consume that specific dataset and then stop it. For specific reasons we decided not to use the batch version of spark.

Hence is there any timeout or something that can be used to detect that there is no more data coming it and that everything has be processed.

Thank you

450

asked Sep 25 '18 12:09

MaatDeamon

3 Answers

Structured Streaming Monitoring Options

You can use query.lastProgress to get the timestamp and build logic around that. Don't forget to save your checkpoint to a durable, persistent, available store.

151

answered Nov 07 '22 14:11

Michael West

Putting together a couple pieces of advice:

As @Michael West pointed out, there are listeners to track progress
From what I gather, Structured Streaming doesn't yet support graceful shutdown

So one option is to periodically check for query activity, dynamically shutting down depending on a configurable state (when you determine no further progress can/should be made):

// where you configure your spark job...
spark.streams.addListener(shutdownListener(spark))

// your job code starts here by calling "start()" on the stream...

// periodically await termination, checking for your shutdown state
while(!spark.sparkContext.isStopped) {
  if (shutdown) {
    println(s"Shutting down since first batch has completed...")
    spark.streams.active.foreach(_.stop())
    spark.stop()
  } else {
    // wait 10 seconds before checking again if work is complete
    spark.streams.awaitAnyTermination(10000)
  }
}

Your listener can dynamically shutdown in a variety of ways. For instance, if you're only waiting on a single batch, then just shutdown after the first update:

var shutdown = false
def shutdownListener(spark: SparkSession) = new StreamingQueryListener() {
  override def onQueryStarted(_: QueryStartedEvent): Unit = println("Query started: " + queryStarted.id)
  override def onQueryTerminated(_: QueryTerminatedEvent): Unit = println("Query terminated! " + queryTerminated.id)
  override def onQueryProgress(_: QueryProgressEvent): Unit = shutdown = true
}

Or, if you need to shutdown after more complicated state changes, you could parse the json body of the queryProgress.progress to determine whether or not to shutdown at a given onQueryUpdate event firing.

answered Nov 07 '22 14:11

ecoe

You can probably use this:-

def stopStreamQuery(query: StreamingQuery, awaitTerminationTimeMs: Long) {
    while (query.isActive) {
      try{
        if(query.lastProgress.numInputRows < 10){
          query.awaitTermination(1000)
        }
      }
      catch
      {
        case e:NullPointerException => println("First Batch")
      }
      Thread.sleep(500)
    }
  }

You can make a numInputRows variable.

answered Nov 07 '22 13:11

Nilesh Sinha

Related questions
                            
                                Calling a rest service from Spark
                            
                                Does Spark support BigInteger type?
                            
                                Failed to execute user defined function($anonfun$9: (string) => double) on using String Indexer for multiple columns
                            
                                Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes
                            
                                How to set hive.metastore.warehouse.dir in HiveContext?
                            
                                Spark SQL grouping: Add to group by or wrap in first() if you don't care which value you get.;
                            
                                How to extract rules from decision tree spark MLlib
                            
                                Custom log4j appender in spark executor
                            
                                Uncaught Exception Handling in Spark
                            
                                Why can I not read from the AWS S3 in Spark application anymore?
                            
                                Spark Worker node stops automatically
                            
                                Resolving "Kryo serialization failed: Buffer overflow" Spark exception
                            
                                How to compute the distance matrix in spark?
                            
                                Spark-submit master url and SparkSession master url in the main class, what is difference?
                            
                                null value and countDistinct with spark dataframe
                            
                                How does Apache Spark send functions to other machines under the hood
                            
                                spark on yarn, Connecting to ResourceManager at /0.0.0.0:8032
                            
                                How to setup Spark with a multi node Cassandra cluster?
                            
                                How to stop spark structured streaming from listing all files in an S3 bucket every time
                            
                                Spark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to dynamically stop Spark Structured Streaming?

Tags:

apache-kafka

apache-spark

spark-streaming

spark-structured-streaming