Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Structured streaming - Metrics in Grafana

I am using structured streaming to read data from Kafka and create various aggregate metrics. I have enabled Graphite sink using metrics.properties. I have seen applications in older Spark version have streaming related metrics. I don't see streaming related metrics with Structured streaming. What am I doing wrong?

For example - Not able to find Unprocessed Batches or running batches or last completed batch total delay.

I have enabled streaming metrics by setting:

SparkSession.builder().config("spark.sql.streaming.metricsEnabled",true)

Even then I am getting only 3 metrics:

  • driver.spark.streaming.inputrate
  • driver.spark.streaming.latency
  • driver.spark.streaming.processingrate

These metrics have gaps in between them. Also it starts showing up really late after the application is started. How do I get extensive streaming related metrics to grafana?

I checked StreamingQueryProgress. We can only programmatically creating custom metrics using this one. Is there a way I can consume the metrics which Spark streaming already sends to the sink that I mention?

like image 563
passionate Avatar asked Dec 19 '17 10:12

passionate


People also ask

What is difference between DStream and structured streaming?

Spark streaming introduced Discretized Stream (DStream) for processing data in motion. Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.

What is metrics in Grafana?

Metrics tell you how much of something exists, such as how much memory a computer system has available or how many centimeters long a desktop is. In the case of Grafana, metrics are most useful when they are recorded repeatedly over time.

What is spark streaming and structured streaming?

Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data.


1 Answers

You can still find some of those metrics. The query which actually starts the streaming harness has two methods - lastProgress and recentProgress

They expose details like number of rows processed, duration of the batch, number of input rows in the batch among other things. There is also a method within called json that can get all this information in a single go which can probably be used for sending to some metrics collector.

like image 158
shashydhar Avatar answered Oct 13 '22 08:10

shashydhar