I have a doubt that, how do stages execute in a spark application. Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?
In other words, once a spark action is invoked, a spark job comes into existence which consists of one or more stages and further these stages are broken down into numerous tasks which are worked upon by the executors in parallel. Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs.
Does stages in a job (spark application ?) run parallel in spark? Yes, they can be executed in parallel if there is no sequential dependency. Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.
Spark stages are the physical unit of execution for the computation of multiple tasks. The Spark stages are controlled by the Directed Acyclic Graph (DAG) for any data processing and transformations on the resilient distributed datasets (RDD). There are mainly two stages associated with the Spark frameworks such as, ShuffleMapStage and ResultStage.
Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs. WARNING: It does not mean spark cannot run concurrent jobs. Through this article we will explore how we can boost our default spark application’s performance by running multiple jobs (spark actions) at once.
Check the entities(stages, partitions) in this pic:
pic credits
Does stages in a job(spark application ?) run parallel in spark?
Yes, they can be executed in parallel if there is no sequential dependency.
Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.
Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?
Stage boundary is defined by when data shuffling happens among partitions. (check pink lines in pic)
How do stages execute in a Spark job
Stages of a job can run in parallel if there is no dependencies among them.
In Spark, stages are split by boundries. You have a shuffle stage, which is a boundary stage where transformations are split at, i.e. reduceByKey
, and you have a result stage, which are stages that are bound to yield a result without causing a shuffle, i.e. a map
operation:
(Picture provided by Cloudera)
Since groupByKey
is a shuffle stage, you see the split in pink boxes which marks a boundary.
Internally, a stage is further divided into tasks. e.g in the picture above, the first row which does textFile -> map -> filter
, can be split into three tasks, one for each transformation.
When one transformations output is another transformations input, we need the serial execution. But, if stages are unrelated, i.e hadoopFile -> groupByKey -> map
, they can run in parallel. Once they declare a dependency between them from that stage on they will continue execution serially.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With