Do stages in an application run parallel in spark?

2 Answers

Check the entities(stages, partitions) in this pic:

enter image description here

pic credits

Does stages in a job(spark application ?) run parallel in spark?

Yes, they can be executed in parallel if there is no sequential dependency.

Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.

Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?

Stage boundary is defined by when data shuffling happens among partitions. (check pink lines in pic)

answered Oct 13 '22 16:10

mrsrinivas

How do stages execute in a Spark job

Stages of a job can run in parallel if there is no dependencies among them.

In Spark, stages are split by boundries. You have a shuffle stage, which is a boundary stage where transformations are split at, i.e. reduceByKey, and you have a result stage, which are stages that are bound to yield a result without causing a shuffle, i.e. a map operation:

Spark stages

(Picture provided by Cloudera)

Since groupByKey is a shuffle stage, you see the split in pink boxes which marks a boundary.

Internally, a stage is further divided into tasks. e.g in the picture above, the first row which does textFile -> map -> filter, can be split into three tasks, one for each transformation.

When one transformations output is another transformations input, we need the serial execution. But, if stages are unrelated, i.e hadoopFile -> groupByKey -> map, they can run in parallel. Once they declare a dependency between them from that stage on they will continue execution serially.

answered Oct 13 '22 18:10

Yuval Itzchakov

Related questions
                            
                                Why are Spark Parquet files for an aggregate larger than the original?
                            
                                How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)
                            
                                Missing hive-site when using spark-submit YARN cluster mode
                            
                                AWS connection timeout when running Spark job on EMR
                            
                                Spark - how to get top N of rdd as a new rdd (without collecting at the driver)
                            
                                Apache Livy doesn't work with local jar file
                            
                                RDD CountApproximate taking far longer than requested timeout
                            
                                Limit kafka batch size when using Spark Structured Streaming
                            
                                RDD filter in scala spark
                            
                                pySpark Create DataFrame from RDD with Key/Value
                            
                                Spark streaming data sharing between batches
                            
                                A list as a key for PySpark's reduceByKey
                            
                                Spark crash while reading json file when linked with aws-java-sdk
                            
                                What is the difference between destroy() and unpersist()?
                            
                                Why does Spark fail with "Failed to get broadcast_0_piece0 of broadcast_0" in local mode?
                            
                                spark-redshift takes a lot of time to write to redshift
                            
                                PySpark: spit out single file when writing instead of multiple part files
                            
                                Spark: Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
                            
                                How to create a z-score in Spark SQL for each group
                            
                                Spark 2.0.0 reading json data with variable schema

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Do stages in an application run parallel in spark?

Tags:

apache-spark

A srinivas

People also ask

2 Answers

mrsrinivas

Yuval Itzchakov

Recent Activity

Donate For Us