Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is ExternalRDDScan in the DAG?

What is the meaning of ExternalRDDScan in the DAG?

The whole internet doesn't have an explanation for it.

enter image description here

like image 718
Alon Avatar asked Oct 01 '19 15:10

Alon


People also ask

What is stages in DAG in Spark?

Vertical sequences in DAGs are known as "stages. Stages are implemented in DAGs using the range() function, and output is using the show() function. Further, it proceeds to submit the operator graph to DAG Scheduler by calling an Action on Spark RDD at a high level.

What are DAG stages?

A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.

What is Exchange in Spark DAG?

Exchanges (aka shuffles) are the operations that happen in-between stages. This is how Spark decomposes a job into stages.


1 Answers

Based on the source, ExternalRDDScan is a representation of converting existing RDD of arbitrary objects to a dataset of InternalRows, i.e. creating a DataFrame. Let's verify that our understanding is correct:

scala> import spark.implicits._
import spark.implicits._

scala> val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> rdd.toDF().explain()
== Physical Plan ==
*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]
like image 167
mazaneicha Avatar answered Oct 13 '22 21:10

mazaneicha