Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding DAG in spark

The question is I have the following DAG:

enter image description here

I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages?

I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to cogroup. But to cogroup we need data from stage 0 and stage 1.

So Spark keeps the intermediate results of these stages and then apply it on the Stage 2?

like image 566
St.Antario Avatar asked Aug 31 '17 20:08

St.Antario


People also ask

What is DAG in Spark explain it?

DAG visualization: Visual representation of the directed acyclic graph of this job where vertices represent the RDDs or DataFrames and the edges represent an operation to be applied on RDD.

What is DAG and how it works?

A database availability group (DAG) is a set of up to 16 Exchange Mailbox servers that provides automatic, database-level recovery from a database, server, or network failure. DAGs use continuous replication and a subset of Windows failover clustering technologies to provide high availability and site resilience.

Why is DAG used in RDD?

A stage is comprised of tasks based on partitions of the input data. DAG allows the user to dive into the stage and expand on detail on any stage. In the stage view, the details of all RDDs belonging to that stage are expanded. The Scheduler splits the Spark RDD into stages based on various transformations applied.

What is the purpose of a DAG?

DAGs are used to encode researchers' a priori assumptions about the relationships between and among variables in causal structures. DAGs contain directed edges (arrows), linking nodes (variables), and their paths.


1 Answers

You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions;

In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage".

Now, stage 0 and stage 1 operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage.

NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, stage 2 (which performs the cogroup) would wait for both stage 0 and stage 1 to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.

like image 119
Tzach Zohar Avatar answered Oct 18 '22 18:10

Tzach Zohar