Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing the result of explain() in pyspark

In pyspark, running:

sdf = sqlContext.sql("""SELECT * FROM t1 JOIN t2 on t1.c1 = t2.c1 """)

and then:

sdf.explain(extended=True)

it prints the logical and physical plans of the query execution.

My question is: How can I capture the output in a variable, instead of printing it?

v = sdf.explain(extended=True) naturally, does not work

like image 893
Borislav Aymaliev Avatar asked Jan 10 '19 08:01

Borislav Aymaliev


People also ask

What is explain () in PySpark?

The EXPLAIN statement is used to provide logical/physical plans for an input statement. By default, this clause provides information about a physical plan only.

How do you use the Collect function in PySpark?

After creating the Dataframe, for retrieving all the data from the dataframe we have used the collect() action by writing df. collect(), this will return the Array of row type, in the below output shows the schema of the dataframe and the actual created Dataframe.

How do you explain Spark?

Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.

How do you read the Spark execution plan?

An execution plan is the set of operations executed to translate a query language statement (SQL, Spark SQL, Dataframe operations, etc.) to a set of optimized logical and physical operations.


1 Answers

If you take a look at the source code of explain (version 2.4 or older), you see that :

def explain(self, extended=False):
    if extended:
        print(self._jdf.queryExecution().toString())
    else:
        print(self._jdf.queryExecution().simpleString())

Therefore, if you want to retrieve the explain plan directly, just use the method _jdf.queryExecution() on your dataframe :

v = sdf._jdf.queryExecution().toString()  # or .simpleString()

From 3.0, the code is :

print(
    self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode)
)

Removing the print, you get the explain as a string.

like image 119
Steven Avatar answered Sep 29 '22 00:09

Steven