What is the meaning of ExternalRDDScan in the DAG?
The whole internet doesn't have an explanation for it.
Vertical sequences in DAGs are known as "stages. Stages are implemented in DAGs using the range() function, and output is using the show() function. Further, it proceeds to submit the operator graph to DAG Scheduler by calling an Action on Spark RDD at a high level.
A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
Exchanges (aka shuffles) are the operations that happen in-between stages. This is how Spark decomposes a job into stages.
Based on the source, ExternalRDDScan
is a representation of converting existing RDD of arbitrary objects to a dataset of InternalRow
s, i.e. creating a DataFrame
. Let's verify that our understanding is correct:
scala> import spark.implicits._
import spark.implicits._
scala> val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
scala> rdd.toDF().explain()
== Physical Plan ==
*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With