Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations

When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when is the lineage graph generated? is it generated before the DAG of Spark tasks?

like image 907
Rui Avatar asked Jul 29 '15 19:07

Rui


People also ask

What is lineage graph in spark?

When one derives the new RDD from existing (previous) RDD using transformation, Spark keeps the track of all the dependencies between RDD is called lineage graph. (1) When there is a demand for computing the new RDD.

What is a DAG graph in spark?

What is DAG in Spark? DAG means that Directed Acyclic Graph(No directed cycles). It is a set of Edges and Vertices, where vertices act the RDDs. Edges act as the Operation to be applied on RDD. It is a collection of all transformation and actions. Lineage Graph vs DAG: Lineage Graph is dealing with only RDDs so it is applicable to transformations

How the RDD lineage graph happens in programmatically?

How the RDD lineage graph happens in programmatically: What is DAG in Spark? DAG means that Directed Acyclic Graph (No directed cycles). It is a set of Edges and Vertices, where vertices act the RDDs. Edges act as the Operation to be applied on RDD. It is a collection of all transformation and actions.

What is an RDD lineage in spark?

RDD Lineage is just a portion of a DAG (one or more operations) that lead to the creation of that particular RDD. So, one DAG (one Spark program) might create multiple RDDs, and each RDD will have its lineage (i.e that path in your DAG that lead to that RDD).


1 Answers

An RDD can depend on zero or more other RDDs. For example when you say x = y.map(...), x will depend on y. These dependency relationships can be thought of as a graph.

You can call this graph a lineage graph, as it represents the derivation of each RDD. It is also necessarily a DAG, since a loop is impossible to be present in it.

Narrow dependencies, where a shuffle is not required (think map and filter) can be collapsed into a single stage. Stages are a unit of execution, and they are generated by the DAGScheduler from the graph of RDD dependencies. Stages also depend on each other. The DAGScheduler builds and uses this dependency graph (which is also necessarily a DAG) to schedule the stages.

like image 65
Daniel Darabos Avatar answered Oct 02 '22 08:10

Daniel Darabos