In pyspark, running: <code>sdf = sqlContext.sql("""SELECT * FROM t1 JOIN t2 on t1.c1 = t2.c1 """)</code> and then: <code>sdf.explain(extended=True)</code> it prints the logical and physical plans of the query execution. My question is: How can I capture the output in a variable, instead of printing it? <code>v = sdf.explain(extended=True)</code> naturally, does not work

If you take a look at the source code of <code>explain</code> (version 2.4 or older), you see that : <pre class="prettyprint lang-py prettyprint-override"><code>def explain(self, extended=False): if extended: print(self._jdf.queryExecution().toString()) else: print(self._jdf.queryExecution().simpleString()) </code></pre> Therefore, if you want to retrieve the explain plan directly, just use the method <code>_jdf.queryExecution()</code> on your dataframe : <pre class="prettyprint lang-py prettyprint-override"><code>v = sdf._jdf.queryExecution().toString() # or .simpleString() </code></pre> <hr> From 3.0, the code is : <pre class="prettyprint lang-py prettyprint-override"><code>print( self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode) ) </code></pre> Removing the print, you get the <code>explain</code> as a string.

Capturing the result of explain() in pyspark

1 Answers

If you take a look at the source code of explain (version 2.4 or older), you see that :

def explain(self, extended=False):
    if extended:
        print(self._jdf.queryExecution().toString())
    else:
        print(self._jdf.queryExecution().simpleString())

Therefore, if you want to retrieve the explain plan directly, just use the method _jdf.queryExecution() on your dataframe :

v = sdf._jdf.queryExecution().toString()  # or .simpleString()

From 3.0, the code is :

print(
    self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode)
)

Removing the print, you get the explain as a string.

119

answered Sep 29 '22 00:09

Steven

Related questions
                            
                                What row is used in dropDuplicates operator?
                            
                                Create an empty array column of certain type in pyspark DataFrame
                            
                                Ignoring non-spark config property: hive.exec.dynamic.partition.mode
                            
                                How to CREATE TABLE USING delta with Spark 2.4.4?
                            
                                Write and read raw byte arrays in Spark - using Sequence File SequenceFile
                            
                                How to check if Spark RDD is in memory?
                            
                                Can Spark code be run on cluster without spark-submit?
                            
                                How to save a spark RDD in gzip format through pyspark
                            
                                Parquet predicate pushdown
                            
                                How to map variable names to features after pipeline
                            
                                Find minimum for a timestamp through Spark groupBy dataframe
                            
                                Config file to define JSON Schema Structure in PySpark
                            
                                Spark Context is not automatically created in Scala Spark Shell
                            
                                Number of Executors in Spark Local Mode
                            
                                How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?
                            
                                Spark: converting GMT time stamps to Eastern taking daylight savings into account
                            
                                How many SparkSessions can a single application have?
                            
                                How to get a string representation of DataFrame (as does Dataset.show)?
                            
                                spark.sql.shuffle.partitions of 200 default partitions conundrum
                            
                                Ambiguous schema in Spark Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Capturing the result of explain() in pyspark

Tags:

apache-spark

pyspark

Borislav Aymaliev

People also ask

1 Answers

Steven

Recent Activity

Donate For Us