The question is I have the following DAG: <img src="https://i.stack.imgur.com/yQf7L.png" alt="enter image description here"> I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages? I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to <code>cogroup</code>. But to cogroup we need data from <code>stage 0</code> and <code>stage 1</code>. So Spark keeps the intermediate results of these stages and then apply it on the <code>Stage 2</code>?

You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions; In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage". Now, <code>stage 0</code> and <code>stage 1</code> operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage. NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, <code>stage 2</code> (which performs the <code>cogroup</code>) would wait for both <code>stage 0</code> and <code>stage 1</code> to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.

Understanding DAG in spark

Tags:

java

scala

apache-spark

The question is I have the following DAG:

enter image description here

I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages?

I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to cogroup. But to cogroup we need data from stage 0 and stage 1.

So Spark keeps the intermediate results of these stages and then apply it on the Stage 2?

566

asked Aug 31 '17 20:08

St.Antario

1 Answers

You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions;

In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage".

Now, stage 0 and stage 1 operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage.

NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, stage 2 (which performs the cogroup) would wait for both stage 0 and stage 1 to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.

119

answered Oct 18 '22 18:10

Tzach Zohar

Related questions
                            
                                Firebase user authentication for java application (Not Android)
                            
                                JComboBox popup appears and hide immediately when clicking on its border (Bad User Experience)
                            
                                restdocs SnippetException due to HAL "_links" elements from spring-data-rest
                            
                                Using DataSource to connect to SQLite with (Xerial) sqlite-jdbc driver
                            
                                Why is "this" in an ES6 class not implicit?
                            
                                What is the difference between Binding and Dispatching in Java?
                            
                                Use case for Lock.tryLock()
                            
                                && (logical and) and || (logical or) operators in Logback configuration (if statement)
                            
                                Including Java library built with Gradle throws NoClassDefFoundError
                            
                                many-to-many-relationship between two entities in spring boot
                            
                                JPA managed mapping with different entities error
                            
                                How to take marker Tag when used cluster Algorithm in Android
                            
                                What is the difference between spring parent context and child context?
                            
                                Spring @ExceptionHandler and multi-threading
                            
                                Parse Armored ECC public/private keys (generated from gpg cli) in java
                            
                                Why is @HeadMapping unavailable in Spring MVC?
                            
                                Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`?
                            
                                Java ServiceLoader explanation
                            
                                YouTubePlayerFragment in ListView with AppCompatActivity Error
                            
                                Jackson deserialize GeoJson Point in Spring Boot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With