Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do the blue blocks in spark stage DAG visualisation UI mean?

Tags:

apache-spark

in the following snip for the application UI, what do the blue blocks in each stage represent?

What do "Exchange" and "WholeStageCodeGen", etc mean?

Where can I find a resource to interpret what spark is doing here?

Many thanks

What are the blue blocks? What do their names represent?

like image 580
ThatDataGuy Avatar asked Nov 14 '16 13:11

ThatDataGuy


People also ask

What is WholeStageCodegen in Spark DAG?

Whole-Stage Code Generation (aka WholeStageCodegen or WholeStageCodegenExec) fuses multiple operators (as a subtree of plans that support codegen) together into a single Java function that is aimed at improving execution performance.

Where is the DAG in Spark UI?

When you click on a job on the summary page, you see the details page for that job. The details page further shows the event timeline, DAG visualization, and all stages of the job. When you click on a specific job, you can see the detailed information of this job.

What are skipped stages in Spark?

Skipped stages are cached stages marked in grey, where computation values are stored in memory and not recomputed after accessing HDFS. A glance at the DAG visualization is enough to know if RDD computations are repeatedly performed or cached stages are used.


1 Answers

Each blue box is the steps of Apache Spark job.

You are asking about the WholeStageCodegen this stuff is:

Whole-Stage Code Generation (aka WholeStageCodegen or WholeStageCodegenExec) fuses multiple operators (as a subtree of plans that support codegen) together into a single Java function that is aimed at improving execution performance. It collapses a query into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.

You can see details here SPARK-12795

The exchange means the Shuffle Exchange between jobs in more details:

ShuffleExchange is a unary physical operator. It corresponds to Repartition (with shuffle enabled) and RepartitionByExpression logical operators (as translated in BasicOperators strategy).

All this information you can get in your code using the explain command

Each step shows you what your dataframe is going to do, this is good to find if your logic is right. If you want more details about Spark UI I suggest you to see this presentation of Spark Summit and read this article about the execution planning.

These information will show you much more about your doubt.

like image 117
Thiago Baldim Avatar answered Oct 16 '22 15:10

Thiago Baldim