Monitoring Structured Streaming

Tags:

I have a structured stream set up that is running just fine, but I was hoping to monitor it while it is running.

I have built an EventCollector

class EventCollector extends StreamingQueryListener{
  override def onQueryStarted(event: QueryStartedEvent): Unit = {
    println("Start")
  }

  override def onQueryProgress(event: QueryProgressEvent): Unit = {
    println(event.queryStatus.prettyJson)
  }

  override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {
    println("Term")
  }

I have built an EventCollector and added the listener to my spark session

val listener = new EventCollector()
spark.streams.addListener(listener)

Then I fire off the query

val query = inputDF.writeStream
  //.format("console")
  .queryName("Stream")
  .foreach(writer)
  .start()

query.awaitTermination()

However, onQueryProgress never gets hit. onQueryStarted does, but I was hoping to get the progress of the query at a certain interval to monitor how the queries are doing. Can anyone assist with this?

751

asked Dec 02 '16 17:12

Leyth G

1 Answers

After much research into this topic, this is what I have found...

OnQueryProgress gets hit in between queries. I am not sure if this intentional functionality or not, but while we are streaming data from a file, the OnQueryProgress does not fire off.

A solution I have found was to rely on the foreach writer sink and perform my own analysis of performance within the process function. Unfortunately, we cannot access specific information about the query that is running. Or, I have not figured out how to yet. This is what I have implemented in my sandbox to analyze performance:

val writer = new ForeachWriter[rawDataRow] {
    def open(partitionId: Long, version: Long):Boolean = {
        //We end up here in between files
        true
    }
    def process(value: rawDataRow) = {
        counter += 1

        if(counter % 1000 == 0) {
            val currentTime = System.nanoTime()
            val elapsedTime = (currentTime - startTime)/1000000000.0

            println(s"Records Written:  $counter")
            println(s"Time Elapsed: $elapsedTime seconds")
        }
     }
}

An alternative way to get metrics:

Another way to get information about the running queries is to hit the GET endpoint that spark provides us.

http://localhost:4040/metrics

http://localhost:4040/api/v1/

Documentation here: http://spark.apache.org/docs/latest/monitoring.html

Update Number 2 Sept 2017: Tested on regular spark streaming, not structured streaming

Disclaimer, this may not apply to structured streaming, I need to set up a test bed to confirm. However, it does work with regular spark streaming(Consuming from Kafka in this example).

I believe, since spark streaming 2.2 has been released, new endpoints exist that can retrieve more metrics on the performance of the stream. This may have existed in previous versions and I just missed it, but I wanted to make sure it was documented for anyone else searching for this information.

http://localhost:4040/api/v1/applications/{applicationIdHere}/streaming/statistics

This is the endpoint that looks like it was added in 2.2 (Or it already existed and was just added the documentation, I'm not sure, I haven't checked).

Anyways, it adds metrics in this format for the streaming application specified:

{
  "startTime" : "2017-09-13T14:02:28.883GMT",
  "batchDuration" : 1000,
  "numReceivers" : 0,
  "numActiveReceivers" : 0,
  "numInactiveReceivers" : 0,
  "numTotalCompletedBatches" : 90379,
  "numRetainedCompletedBatches" : 1000,
  "numActiveBatches" : 0,
  "numProcessedRecords" : 39652167,
  "numReceivedRecords" : 39652167,
  "avgInputRate" : 771.722,
  "avgSchedulingDelay" : 2,
  "avgProcessingTime" : 85,
  "avgTotalDelay" : 87
}

This gives us the ability to build our own custom metric/monitoring applications using the REST endpoints that are exposed by Spark.

165

answered Oct 08 '22 08:10

Leyth G

Related questions
                            
                                Memory leak in Scala and processes
                            
                                Scala binary serialization library
                            
                                Why does IDEA report "Error:scalac: error while loading Object, Missing dependency 'object scala in compiler mirror'" building scala breeze?
                            
                                UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)
                            
                                Why does Scala have path-dependent types?
                            
                                Scala slow builds: development approaches to avoid
                            
                                Scala: checking if an object is Numeric
                            
                                How to write non-leaking tail-recursive function using Stream.cons in Scala?
                            
                                Fold/reduce over List of Futures with associative & commutative operator
                            
                                How to define type lambda properly?
                            
                                Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space
                            
                                Apache IVY error message? : impossible to get artifacts when data has not been loaded
                            
                                How to get Scala imports working in IntelliJ IDEA with the Play framework?
                            
                                Overriding a method with an object
                            
                                How to do Java String matching using Boolean Search Syntax?
                            
                                Avoiding repetition using lenses whilst deep-copying into Map values
                            
                                Why would I want to re-implement lazy?
                            
                                Scala Slick Lazy Fetch
                            
                                Converting List[Option[A]] to an Option[List[A]] in Scala
                            
                                When is `private[this] def` a performance advantage over `private def`?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Monitoring Structured Streaming

Tags:

scala

apache-spark

spark-structured-streaming

Leyth G

People also ask

1 Answers

Leyth G

Recent Activity

Donate For Us