Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do green-shaded boxes in Spark DAG Visualization mean?

Tags:

In Spark Web UI, there are two DAG visualizations, one for the Job: enter image description here

the other for the Stage: enter image description here

as explained here. The blog post does explain about the green dots in the Job's DAG, however, it says nothing about those green-shaded boxes in Stage's DAG. Could someone please give a hint?

Update: If that also means the code indicated is where data is cached, what can we do to improve the performance?

like image 303
FuzzY Avatar asked Jul 04 '17 18:07

FuzzY


People also ask

What is green dot in Spark DAG?

Description. A green dot in the DAG visualization apparently means that the referenced RDD is cached.

Why are some tasks stages skipped in Spark?

Stage Skipped means that data has been fetched from cache and re-execution of the given stage is not required. Basically the stage has been evaluated before, and the result is available without re-execution. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey).

What is DAG in Azure Databricks?

DAG or Directed Acyclic Graph is defined as a set of the Vertices and the edges where the vertices represent Resilient distributed systems(RDD), and edges represent the Operation which is to be applied on RDD.


1 Answers

It is mentioned in the link you provided that

Second, one of the RDDs is cached in the first stage (denoted by the green highlight)

So the green boxes indicate that they are being cached and future reference to those rdds won't have to be generated from scratch.

like image 86
Ramesh Maharjan Avatar answered Sep 22 '22 14:09

Ramesh Maharjan