The question is I have the following DAG:
I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages?
I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to cogroup
. But to cogroup we need data from stage 0
and stage 1
.
So Spark keeps the intermediate results of these stages and then apply it on the Stage 2
?
DAG visualization: Visual representation of the directed acyclic graph of this job where vertices represent the RDDs or DataFrames and the edges represent an operation to be applied on RDD.
A database availability group (DAG) is a set of up to 16 Exchange Mailbox servers that provides automatic, database-level recovery from a database, server, or network failure. DAGs use continuous replication and a subset of Windows failover clustering technologies to provide high availability and site resilience.
A stage is comprised of tasks based on partitions of the input data. DAG allows the user to dive into the stage and expand on detail on any stage. In the stage view, the details of all RDDs belonging to that stage are expanded. The Scheduler splits the Spark RDD into stages based on various transformations applied.
DAGs are used to encode researchers' a priori assumptions about the relationships between and among variables in causal structures. DAGs contain directed edges (arrows), linking nodes (variables), and their paths.
You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions;
In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage".
Now, stage 0
and stage 1
operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage.
NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, stage 2
(which performs the cogroup
) would wait for both stage 0
and stage 1
to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With