In Spark Web UI, there are two DAG visualizations, one for the Job:
the other for the Stage:
as explained here. The blog post does explain about the green dots in the Job's DAG, however, it says nothing about those green-shaded boxes in Stage's DAG. Could someone please give a hint?
Update: If that also means the code indicated is where data is cached, what can we do to improve the performance?
Description. A green dot in the DAG visualization apparently means that the referenced RDD is cached.
Stage Skipped means that data has been fetched from cache and re-execution of the given stage is not required. Basically the stage has been evaluated before, and the result is available without re-execution. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey).
DAG or Directed Acyclic Graph is defined as a set of the Vertices and the edges where the vertices represent Resilient distributed systems(RDD), and edges represent the Operation which is to be applied on RDD.
It is mentioned in the link you provided that
Second, one of the RDDs is cached in the first stage (denoted by the green highlight)
So the green boxes indicate that they are being cached and future reference to those rdds won't have to be generated from scratch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With