Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I obtain the DAG of an Apache Spark job without running it?

I have some Scala code that I can run with Spark using spark-submit. From what I understood, Spark creates a DAG in order to schedule the operation.

Is there a way to retrieve this DAG without actually performing the heavy operations, e.g. just by analyzing the code ?

I would like a useful representation such as a data structure or at least a written representation, not the DAG visualization.

like image 638
Quetzakol Avatar asked Sep 16 '17 13:09

Quetzakol


1 Answers

If you are using dataframes (spark sql) you can use df.explain(true) to get the plan and all operations (before and after optimization).

If you are using rdd you can use rdd.toDebugString to get a string representation and rdd.dependencies to get the tree itself.

If you use these without the actual action you would get a representation of what is going to happen without actually doing the heavy lifting.

like image 145
Assaf Mendelson Avatar answered Oct 06 '22 01:10

Assaf Mendelson