The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the <code>internal mechanics</code> on <code>Resilient Distributed Datasets</code> with <code>Directed Acyclic Graph</code> seems lacking in this paper. Should it be better learned by investigating the source code?

Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task. At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler. <ul> <li>The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages. </li> <li>The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.</li> <li>The Worker executes the tasks on the Slave. </li> </ul> Let's come to how Spark builds the DAG. At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries. Narrow transformation - doesn't require the data to be shuffled across the partitions. for example, Map, filter etc.. wide transformation - requires the data to be shuffled for example, reduceByKey etc.. Let's take an example of counting how many log messages appear at each level of severity, Following is the log file that starts with the severity level, <pre class="prettyprint"><code>INFO I'm Info message WARN I'm a Warn message INFO I'm another Info message </code></pre> and create the following scala code to extract the same, <pre class="prettyprint"><code>val input = sc.textFile("log.txt") val splitedLines = input.map(line => line.split(" ")) .map(words => (words(0), 1)) .reduceByKey{(a,b) => a + b} </code></pre> This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call <code>val b = a.map()</code> on a RDD, the RDD <code>b</code> keeps a reference to its parent <code>a</code>, that's a lineage. To display the lineage of an RDD, Spark provides a debug method <code>toDebugString()</code>. For example executing <code>toDebugString()</code> on the <code>splitedLines</code> RDD, will output the following: <pre class="prettyprint"><code>(2) ShuffledRDD[6] at reduceByKey at <console>:25 [] +-(2) MapPartitionsRDD[5] at map at <console>:24 [] | MapPartitionsRDD[4] at map at <console>:23 [] | log.txt MapPartitionsRDD[1] at textFile at <console>:21 [] | log.txt HadoopRDD[0] at textFile at <console>:21 [] </code></pre> The first line (from the bottom) shows the input RDD. We created this RDD by calling <code>sc.textFile()</code>. Below is the more diagrammatic view of the DAG graph created from the given RDD. <img src="https://i.stack.imgur.com/Lb3pQ.png" alt="RDD DAG graph"> Once the DAG is build, the Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create two stage execution as follows: <img src="https://i.stack.imgur.com/K4gJU.png" alt="Stages"> The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 set of tasks created and submitted in parallel provided there are enough slaves/cores. Below diagram illustrates this in more detail: <img src="https://i.stack.imgur.com/GoYQB.png" alt="Task execustion"> For more detailed information i suggest you to go through the following youtube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime. <ol> <li>Advanced Apache Spark- Sameer Farooqui (Databricks)</li> <li>A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)</li> <li>Introduction to AmpLab Spark Internals</li> </ol>

Beginning <code>Spark 1.4</code> visualization of data has been added through the following three components where it also provide a clear graphical representation of <code>DAG</code>. <ul> <li>Timeline view of Spark events</li> <li>Execution DAG</li> <li>Visualization of Spark Streaming statistics</li> </ul> Refer to link for more information.

How DAG works under the covers in RDD?

Tags:

apache-spark

rdd

directed-acyclic-graphs

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the internal mechanics on Resilient Distributed Datasets with Directed Acyclic Graph seems lacking in this paper.

Should it be better learned by investigating the source code?

466

asked Sep 14 '14 17:09

sof

2 Answers

Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task.

At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler.

The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker executes the tasks on the Slave.

Let's come to how Spark builds the DAG.

At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.

Narrow transformation - doesn't require the data to be shuffled across the partitions. for example, Map, filter etc..

wide transformation - requires the data to be shuffled for example, reduceByKey etc..

Let's take an example of counting how many log messages appear at each level of severity,

Following is the log file that starts with the severity level,

INFO I'm Info message WARN I'm a Warn message INFO I'm another Info message

and create the following scala code to extract the same,

val input = sc.textFile("log.txt") val splitedLines = input.map(line => line.split(" "))                         .map(words => (words(0), 1))                         .reduceByKey{(a,b) => a + b}

This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.

To display the lineage of an RDD, Spark provides a debug method toDebugString(). For example executing toDebugString() on the splitedLines RDD, will output the following:

(2) ShuffledRDD[6] at reduceByKey at <console>:25 []     +-(2) MapPartitionsRDD[5] at map at <console>:24 []     |  MapPartitionsRDD[4] at map at <console>:23 []     |  log.txt MapPartitionsRDD[1] at textFile at <console>:21 []     |  log.txt HadoopRDD[0] at textFile at <console>:21 []

The first line (from the bottom) shows the input RDD. We created this RDD by calling sc.textFile(). Below is the more diagrammatic view of the DAG graph created from the given RDD.

RDD DAG graph

Once the DAG is build, the Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create two stage execution as follows:

Stages

The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 set of tasks created and submitted in parallel provided there are enough slaves/cores. Below diagram illustrates this in more detail:

Task execustion

For more detailed information i suggest you to go through the following youtube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.

Advanced Apache Spark- Sameer Farooqui (Databricks)
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Introduction to AmpLab Spark Internals

170

answered Oct 02 '22 17:10

Sathish

Beginning Spark 1.4 visualization of data has been added through the following three components where it also provide a clear graphical representation of DAG.

Timeline view of Spark events
Execution DAG
Visualization of Spark Streaming statistics

Refer to link for more information.

answered Oct 02 '22 18:10

Prabhakar Reddy

Related questions
                            
                                Spark functions vs UDF performance?
                            
                                How to access s3a:// files from Apache Spark?
                            
                                PySpark - rename more than one column using withColumnRenamed
                            
                                How do I log from my Python Spark script
                            
                                PySpark: java.lang.OutofMemoryError: Java heap space
                            
                                Retrieve top n in each group of a DataFrame in pyspark
                            
                                PySpark: How to fillna values in dataframe for specific columns?
                            
                                How to convert a DataFrame back to normal RDD in pyspark?
                            
                                How to import multiple csv files in a single load?
                            
                                How to list all cassandra tables
                            
                                What is the concept of application, job, stage and task in spark?
                            
                                How to query JSON data column using Spark DataFrames?
                            
                                How to aggregate values into collection after groupBy?
                            
                                "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory
                            
                                Spark: subtract two DataFrames
                            
                                Spark : how to run spark file from spark shell
                            
                                collect_list by preserving order based on another variable
                            
                                Apache Spark vs Akka [closed]
                            
                                Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
                            
                                Add an empty column to Spark DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With