I am using structured streaming to read data from Kafka and create various aggregate metrics. I have enabled Graphite sink using metrics.properties
. I have seen applications in older Spark version have streaming related metrics. I don't see streaming related metrics with Structured streaming. What am I doing wrong?
For example - Not able to find Unprocessed Batches or running batches or last completed batch total delay.
I have enabled streaming metrics by setting:
SparkSession.builder().config("spark.sql.streaming.metricsEnabled",true)
Even then I am getting only 3 metrics:
These metrics have gaps in between them. Also it starts showing up really late after the application is started. How do I get extensive streaming related metrics to grafana?
I checked StreamingQueryProgress
. We can only programmatically creating custom metrics using this one. Is there a way I can consume the metrics which Spark streaming already sends to the sink that I mention?
Spark streaming introduced Discretized Stream (DStream) for processing data in motion. Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.
Metrics tell you how much of something exists, such as how much memory a computer system has available or how many centimeters long a desktop is. In the case of Grafana, metrics are most useful when they are recorded repeatedly over time.
Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data.
You can still find some of those metrics. The query which actually starts the streaming harness has two methods - lastProgress and recentProgress
They expose details like number of rows processed, duration of the batch, number of input rows in the batch among other things. There is also a method within called json
that can get all this information in a single go which can probably be used for sending to some metrics collector.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With