Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do stages in an application run parallel in spark?

Tags:

apache-spark

I have a doubt that, how do stages execute in a spark application. Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?

like image 383
A srinivas Avatar asked Dec 27 '16 07:12

A srinivas


People also ask

How does spark run multiple tasks in parallel?

In other words, once a spark action is invoked, a spark job comes into existence which consists of one or more stages and further these stages are broken down into numerous tasks which are worked upon by the executors in parallel. Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs.

Does stages in a job(spark application ?) run parallel?

Does stages in a job (spark application ?) run parallel in spark? Yes, they can be executed in parallel if there is no sequential dependency. Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.

What are the spark stages?

Spark stages are the physical unit of execution for the computation of multiple tasks. The Spark stages are controlled by the Directed Acyclic Graph (DAG) for any data processing and transformations on the resilient distributed datasets (RDD). There are mainly two stages associated with the Spark frameworks such as, ShuffleMapStage and ResultStage.

Can spark run multiple jobs at once?

Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs. WARNING: It does not mean spark cannot run concurrent jobs. Through this article we will explore how we can boost our default spark application’s performance by running multiple jobs (spark actions) at once.


2 Answers

Check the entities(stages, partitions) in this pic:

enter image description here

pic credits

Does stages in a job(spark application ?) run parallel in spark?

Yes, they can be executed in parallel if there is no sequential dependency.

Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.

Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?

Stage boundary is defined by when data shuffling happens among partitions. (check pink lines in pic)

like image 95
mrsrinivas Avatar answered Oct 13 '22 16:10

mrsrinivas


How do stages execute in a Spark job

Stages of a job can run in parallel if there is no dependencies among them.

In Spark, stages are split by boundries. You have a shuffle stage, which is a boundary stage where transformations are split at, i.e. reduceByKey, and you have a result stage, which are stages that are bound to yield a result without causing a shuffle, i.e. a map operation:

Spark stages

(Picture provided by Cloudera)

Since groupByKey is a shuffle stage, you see the split in pink boxes which marks a boundary.

Internally, a stage is further divided into tasks. e.g in the picture above, the first row which does textFile -> map -> filter, can be split into three tasks, one for each transformation.

When one transformations output is another transformations input, we need the serial execution. But, if stages are unrelated, i.e hadoopFile -> groupByKey -> map, they can run in parallel. Once they declare a dependency between them from that stage on they will continue execution serially.

like image 42
Yuval Itzchakov Avatar answered Oct 13 '22 18:10

Yuval Itzchakov