In pyspark, running:
sdf = sqlContext.sql("""SELECT * FROM t1 JOIN t2 on t1.c1 = t2.c1 """)
and then:
sdf.explain(extended=True)
it prints the logical and physical plans of the query execution.
My question is: How can I capture the output in a variable, instead of printing it?
v = sdf.explain(extended=True)
naturally, does not work
The EXPLAIN statement is used to provide logical/physical plans for an input statement. By default, this clause provides information about a physical plan only.
After creating the Dataframe, for retrieving all the data from the dataframe we have used the collect() action by writing df. collect(), this will return the Array of row type, in the below output shows the schema of the dataframe and the actual created Dataframe.
Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.
An execution plan is the set of operations executed to translate a query language statement (SQL, Spark SQL, Dataframe operations, etc.) to a set of optimized logical and physical operations.
If you take a look at the source code of explain
(version 2.4 or older), you see that :
def explain(self, extended=False):
if extended:
print(self._jdf.queryExecution().toString())
else:
print(self._jdf.queryExecution().simpleString())
Therefore, if you want to retrieve the explain plan directly, just use the method _jdf.queryExecution()
on your dataframe :
v = sdf._jdf.queryExecution().toString() # or .simpleString()
From 3.0, the code is :
print(
self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode)
)
Removing the print, you get the explain
as a string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With