How can I obtain the DAG of an Apache Spark job without running it?

Question

I have some Scala code that I can run with Spark using spark-submit. From what I understood, Spark creates a DAG in order to schedule the operation.

Is there a way to retrieve this DAG without actually performing the heavy operations, e.g. just by analyzing the code ?

I would like a useful representation such as a data structure or at least a written representation, not the DAG visualization.

Assaf Mendelson · Accepted Answer

If you are using dataframes (spark sql) you can use df.explain(true) to get the plan and all operations (before and after optimization).

If you are using rdd you can use rdd.toDebugString to get a string representation and rdd.dependencies to get the tree itself.

If you use these without the actual action you would get a representation of what is going to happen without actually doing the heavy lifting.

Donate For Us