Spark UI DAG stage disconnected

Tags:

apache-spark

I ran the following job in the spark-shell:

val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i)).persist
d.join(d.reduceByKey(_ + _)).collect

The Spark UI shows three stages. Stage 4 and 5 correspond to the computation of d, and stage 6 corresponds to the computation of the collect action. Since d is persisted, I would expect only two stages. However stage 5 is present not connected to any other stages.

Spark UI DAG

So tried running the same computation without using persist, and the DAG looks like identically, except without the green dots indicating the RDD has been persisted.

Spark UI DAG without persist

I would expect the output of stage 11 to be connect to the input of stage 12, but it is not.

Looking at the stage descriptions, the stages seem to indicate that d is being persisted, because stage 5 has input, but I am still confused as to why stage 5 even exists.

Spark UI stages

Spark UI stages without persist

723

asked Nov 16 '16 15:11

Justin Raymond

1 Answers

Input RDD is cached and cached part is not recomputed.

This can be validated with a simple test:

import org.apache.spark.SparkContext

def f(sc: SparkContext) = {
  val counter = sc.longAccumulator("counter")
  val rdd = sc.parallelize(0 until 100).map(i => {
    counter.add(1L)
    (i%10, i)
  }).persist
  rdd.join(rdd.reduceByKey(_ + _)).foreach(_ => ())
  counter.value
}

assert(f(spark.sparkContext) == 100)

Caching doesn't remove stages from DAG.

If data is cached corresponding stages can be marked as skipped but are still part of the DAG. Lineage can be truncated using checkpoints but it is not the same thing and it doesn't remove stages from visualization.
Input stages contain more than cached computations.

Spark stages group together operations which can be chained without performing shuffle.

While part of the input stage is cached it doesn't cover all the operations required to prepare shuffle files. This is why you don't see skipped tasks.
The rest (detachment) is just a limitation of the graph visualization.

If you repartition data first:

import org.apache.spark.HashPartitioner

val d = sc.parallelize(0 until 1000000)
  .map(i => (i%100000, i))
  .partitionBy(new HashPartitioner(20))

d.join(d.reduceByKey(_ + _)).collect

you'll get DAG you're most likely looking for:

enter image description here

117

answered Oct 11 '22 11:10

zero323

Related questions
                            
                                What is the preferred way to avoid SQL injections in Spark-SQL (on Hive)
                            
                                why do I get "The requested resource could not be found." when accessing simple spray route?
                            
                                Add a new line to a text file in Spark
                            
                                MDC (Mapped Diagnostic Context) Logging in AKKA
                            
                                Scala - sort based on Future result predicate
                            
                                How to structure database access layer with transactions
                            
                                Scala using shapeless to combine higher kinded coproducts over a natural transformation
                            
                                Generate Strings from Grammar in ScalaCheck
                            
                                How do I change Guice bindings for functional tests?
                            
                                Unit testing file upload in a controller with Java Play Framework 2.3.x
                            
                                How to fix "A protocol message was rejected because it was too big" from Google Protobuf in Spark on Mesos?
                            
                                How to extract data from build.sbt in command line
                            
                                Spark ML VectorAssembler() dealing with thousands of columns in dataframe
                            
                                Scala akka-http WebSocket: How to save the client connection and push message to the client when needed?
                            
                                Using DependsOn between two ScalaJS SBT projects
                            
                                Converting a Java map into a Scala immutable map in Java code [duplicate]
                            
                                Is there a way to use snapshots with PersistenceQuery
                            
                                Dealer (Server) to Dealer (Worker) Not Working
                            
                                Scala Else return function
                            
                                How to use a Non Supported Database in Slick 3.1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With