Is there a way to monitor the input and output throughput of a Spark cluster, to make sure the cluster is not flooded and overflowed by incoming data? In my case, I set up Spark cluster on AWS EC2, so I'm thinking of using AWS CloudWatch to monitor the NetworkIn and NetworkOut for each node in the cluster. But my idea seems to be not accurate and network does not meaning incoming data for Spark only, maybe also some other data would be calculated too. Is there a tool or way to monitor specifically for Spark cluster streaming data status? Or there's already a built-in tool in Spark that I missed? <hr> update: Spark 1.4 released, monitoring at port 4040 is significantly enhanced with graphical display

Spark has a configurable metric subsystem. By default it publishes a JSON version of the registered metrics on <code><driver>:<port>/metrics/json</code>. Other metrics syncs, like ganglia, csv files or JMX can be configured. You will need some external monitoring system that collects metrics on a regular basis an helps you make sense of it. (n.b. We use Ganglia but there's other open source and commercial options) Spark Streaming publishes several metrics that can be used to monitor the performance of your job. To calculate throughput, you would combine: <code>(lastReceivedBatch_processingEndTime-lastReceivedBatch_processingStartTime)/lastReceivedBatch_records</code> For all metrics supported, have a look at StreamingSource Example: Starting a local REPL with Spark 1.3.1 and after executing a trivial streaming application: <pre class="prettyprint"><code>import org.apache.spark.streaming._ val ssc = new StreamingContext(sc, Seconds(10)) val queue = scala.collection.mutable.Queue(1,2,3,45,6,6,7,18,9,10,11) val q = queue.map(elem => sc.parallelize(Seq(elem))) val dstream = ssc.queueStream(q) dstream.print ssc.start </code></pre> one can GET <code>localhost:4040/metrics/json</code> and that returns: <pre class="prettyprint"><code>{ version: "3.0.0", gauges: { local-1430558777965.<driver>.BlockManager.disk.diskSpaceUsed_MB: { value: 0 }, local-1430558777965.<driver>.BlockManager.memory.maxMem_MB: { value: 2120 }, local-1430558777965.<driver>.BlockManager.memory.memUsed_MB: { value: 0 }, local-1430558777965.<driver>.BlockManager.memory.remainingMem_MB: { value: 2120 }, local-1430558777965.<driver>.DAGScheduler.job.activeJobs: { value: 0 }, local-1430558777965.<driver>.DAGScheduler.job.allJobs: { value: 6 }, local-1430558777965.<driver>.DAGScheduler.stage.failedStages: { value: 0 }, local-1430558777965.<driver>.DAGScheduler.stage.runningStages: { value: 0 }, local-1430558777965.<driver>.DAGScheduler.stage.waitingStages: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingDelay: { value: 44 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime: { value: 1430559950044 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime: { value: 1430559950000 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_submissionTime: { value: 1430559950000 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_totalDelay: { value: 44 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime: { value: 1430559950044 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingStartTime: { value: 1430559950000 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_records: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_submissionTime: { value: 1430559950000 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.receivers: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.retainedCompletedBatches: { value: 2 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.runningBatches: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalCompletedBatches: { value: 2 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalProcessedRecords: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalReceivedRecords: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.unprocessedBatches: { value: 0 }, local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.waitingBatches: { value: 0 } }, counters: { }, histograms: { }, meters: { }, timers: { } } </code></pre>

spark streaming throughput monitoring

1 Answers

Spark has a configurable metric subsystem. By default it publishes a JSON version of the registered metrics on <driver>:<port>/metrics/json. Other metrics syncs, like ganglia, csv files or JMX can be configured.

You will need some external monitoring system that collects metrics on a regular basis an helps you make sense of it. (n.b. We use Ganglia but there's other open source and commercial options)

Spark Streaming publishes several metrics that can be used to monitor the performance of your job. To calculate throughput, you would combine:

(lastReceivedBatch_processingEndTime-lastReceivedBatch_processingStartTime)/lastReceivedBatch_records

For all metrics supported, have a look at StreamingSource

Example: Starting a local REPL with Spark 1.3.1 and after executing a trivial streaming application:

import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(10))
val queue = scala.collection.mutable.Queue(1,2,3,45,6,6,7,18,9,10,11)
val q = queue.map(elem => sc.parallelize(Seq(elem)))
val dstream = ssc.queueStream(q)
dstream.print
ssc.start

one can GET localhost:4040/metrics/json and that returns:

{
version: "3.0.0",
gauges: {
local-1430558777965.<driver>.BlockManager.disk.diskSpaceUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.maxMem_MB: {
value: 2120
},
local-1430558777965.<driver>.BlockManager.memory.memUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.remainingMem_MB: {
value: 2120
},
local-1430558777965.<driver>.DAGScheduler.job.activeJobs: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.job.allJobs: {
value: 6
},
local-1430558777965.<driver>.DAGScheduler.stage.failedStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.runningStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.waitingStages: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_totalDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_records: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.receivers: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.retainedCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.runningBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalProcessedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalReceivedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.unprocessedBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.waitingBatches: {
value: 0
}
},
counters: { },
histograms: { },
meters: { },
timers: { }
}

answered Nov 11 '22 21:11

maasg

Related questions
                            
                                Unique numbers in C++ [closed]
                            
                                How to measure CPU and memory usage of F# code?
                            
                                Are ternary statements faster than if/then/else statements in javascript?
                            
                                Use the Boolean.valueOf() method vs (or Java 1.5 autoboxing) to create Boolean objects
                            
                                How to get the closest item to my key from a SortedDictionary?
                            
                                Fastest way to search a number in a list of ranges
                            
                                Hex To String in Java Performance is too slow
                            
                                groovy 'switch' vs. 'if' performance
                            
                                Improve Rails loading time
                            
                                Does DLL size matter?
                            
                                Writing to file using StreamWriter much slower than file copy over slow network
                            
                                Performance impact of DefaultTraceListener
                            
                                Why does Java have much better performance vs other interpreted languages? [closed]
                            
                                speeding up JSON parsing in Perl
                            
                                What is the cost of try catch blocks?
                            
                                Performance: SortedDictionary vs SortedSet
                            
                                Extremely fast method for modular exponentiation with modulus and exponent of several million digits
                            
                                MongoDB / Mongoose Schema for Threaded Messages (Efficiently)
                            
                                Why Javascript ===/== string equality sometimes has constant time complexity and sometimes has linear time complexity?
                            
                                Why is numpy.power slower for integer exponents?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark streaming throughput monitoring

Tags:

performance

monitoring

apache-spark

amazon-cloudwatch

spark-streaming

keypoint

People also ask

1 Answers

maasg

Recent Activity

Donate For Us