How to let pyspark display the whole query plan instead of ... if there are many fields?

Tags:

apache-spark

pyspark

Spark v2.4

spark = SparkSession \
    .builder \
    .master('local[15]') \
    .appName('Notebook') \
    .config('spark.sql.debug.maxToStringFields', 2000) \
    .config('spark.sql.maxPlanStringLength', 2000) \
    .config('spark.debug.maxToStringFields', 2000) \
    .getOrCreate()

df = spark.createDataFrame(spark.range(1000).rdd.map(lambda x: range(100)))
df.repartition(1).write.mode('overwrite').parquet('test.parquet')

df = spark.read.parquet('test.parquet')
df.select('*').explain()

== Physical Plan ==

 ReadSchema: struct<_1:bigint,_2:bigint,_3:bigint,_4:bigint,_5:bigint,_6:bigint,_7:bigint,_8:bigint,_9:bigint,...

Note: spark.debug.maxToStringFields helped a bit by expanding FileScan parquet [_1#302L,_2#303L,... 76 more fields], but not the schema part.

Note2: I am not only interested in the ReadSchema, but also PartitionFilters, PushedFilters ... which are all truncated.

Update

Spark 3.0 introduced explain('formatted') which layouts the information differently and no truncation is applied.

644

asked Mar 20 '19 13:03

colinfang

1 Answers

I am afraid there is no easy way

https://github.com/apache/spark/blob/v2.4.2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L57

It is hard coded to be no more than 100 chars

  override def simpleString: String = {
    val metadataEntries = metadata.toSeq.sorted.map {
      case (key, value) =>
        key + ": " + StringUtils.abbreviate(redact(value), 100)
    }

In the end I have been using

def full_file_meta(f: FileSourceScanExec) = {
    val metadataEntries = f.metadata.toSeq.sorted.flatMap {
      case (key, value) if Set(
          "Location", "PartitionCount",
          "PartitionFilters", "PushedFilters"
      ).contains(key) =>
        Some(key + ": " + value.toString)
      case other => None
    }

    val metadataStr = metadataEntries.mkString("[\n  ", ",\n  ", "\n]")
    s"${f.nodeNamePrefix}${f.nodeName}$metadataStr"

}

val ep = data.queryExecution.executedPlan

print(ep.flatMap {
    case f: FileSourceScanExec => full_file_meta(f)::Nil
    case other => Nil
}.mkString(",\n"))

It is a hack and better than nothing.

answered Oct 13 '22 18:10

colinfang

Related questions
                            
                                Iterate through a Java RDD by row
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Differentiate driver code and work code in Apache Spark
                            
                                Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL
                            
                                Unit testing with Spark dataframes
                            
                                Apache spark Hive, executable JAR with maven shade
                            
                                Non linear (DAG) ML pipelines in Apache Spark
                            
                                Pyspark socket timeout exception after application running for a while
                            
                                Share config files with spark-submit in cluster mode
                            
                                Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark
                            
                                How to exclude jar in final sbt assembly plugin
                            
                                How can I tell if my spark job is progressing?
                            
                                Difference between spark-submit vs. SparkSession in python script?
                            
                                Spark ML Pipeline with RandomForest takes too long on 20MB dataset
                            
                                Understanding DAG in spark
                            
                                Databricks display() function equivalent or alternative to Jupyter
                            
                                PySpark dataframe to_json() function
                            
                                How to run two spark jobs in parallel in standalone mode [duplicate]
                            
                                Spark - Reading many small parquet files gets status of each file before hand

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to let pyspark display the whole query plan instead of ... if there are many fields?

Tags:

apache-spark

pyspark

Update

colinfang

People also ask

1 Answers

colinfang

Recent Activity

Donate For Us