Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark and profiling or execution plan

Is there any tool in spak that help to understand how the code is interpreted and executed. Like a profiling tool or the details of an execution plan to help optimize the code.

For instance, I have seen that it is better to partition two dataframes on the join key before joining them to avoid extra shuffle. How we can figure out that?

like image 724
GPif Avatar asked Apr 02 '17 09:04

GPif


People also ask

What is profiling in Spark?

The Profiling tool analyzes both CPU or GPU generated event logs and generates information which can be used for debugging and profiling Apache Spark applications. The output information contains the Spark version, executor details, properties, etc.

Why should you examine your execution plan in Spark?

You can use it to know what execution plan Spark will use for your Spark query without actually running it. Spark also provides a Spark UI where you can view the execution plan and other details when the job is running.

What is Spark plan?

SparkPlan is a recursive data structure in Spark SQL's Catalyst tree manipulation framework and as such represents a single physical operator in a physical execution query plan as well as a physical execution query plan itself (i.e. a tree of physical operators in a query plan of a structured query).

What is logical execution plan in Spark?

In layman's terms, a logical plan is a tree that represents both schema and data. These trees are manipulated and optimized by a catalyst framework. The Logical Plan is divided into three parts: Logical Plan Sequential steps.


1 Answers

As Pushkr said, with dataframe and dataset we can use the .explain() method to display the derivation, partion and eventual shuffle.

With RDD we can use the toDebugString for kind of the same result. Also, there is dependencies to indicate if the new rdd derivate from the previous one with narrow or wide dependency.

like image 173
GPif Avatar answered Nov 25 '22 20:11

GPif