I have following (simplified) schema: <pre class="prettyprint"><code>root |-- event: struct (nullable = true) | |-- spent: struct (nullable = true) | | |-- amount: decimal(34,3) (nullable = true) | | |-- currency: string (nullable = true) | | | | ... ~ 20 other struct fields on "event" level </code></pre> I'm trying to sum on nested field <pre class="prettyprint"><code>spark.sql("select sum(event.spent.amount) from event") </code></pre> According to spark metrics I'm reading 18 GB from disk and it takes 2.5 min. However when I select the top level field: <pre class="prettyprint"><code> spark.sql("select sum(amount) from event") </code></pre> I read only 2GB in 4 seconds. From the physical plan I can see that in case of nested structure the whole event struct with all fields are read from parquet, which is a waste. Parquet format should be able to provide the desired column from nested structure without reading it all (which is the point of columnar store). Is there some way to do this efficiently in Spark ?

Solution: <pre class="prettyprint"><code>spark.sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") spark.sql("select sum(amount) from (select event.spent.amount as amount from event_archive)") </code></pre> The query must be written in sub-select fashion. You can't wrap the selected column in aggregate function. Following query will break schema pruning: <pre class="prettyprint"><code>select sum(event.spent.amount) as amount from event </code></pre> Whole schema pruning work is covered in SPARK-4502 Dirty workaround can be also specifying "projected schema" at load time: <pre class="prettyprint"><code>val DecimalType = DataTypes.createDecimalType(18, 4) val schema = StructType(StructField("event", StructType( StructField("spent", StructType( StructField("amount", DecimalType, true) :: Nil ), true) :: Nil ), true) :: Nil ) val df = spark.read.format("parquet").schema(schema).load(<path>) </code></pre>

Efficient reading nested parquet column in Spark

Tags:

apache-spark

parquet

I have following (simplified) schema:

root
 |-- event: struct (nullable = true)
 |    |-- spent: struct (nullable = true)
 |    |    |-- amount: decimal(34,3) (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |
 |    | ... ~ 20 other struct fields on "event" level

I'm trying to sum on nested field

spark.sql("select sum(event.spent.amount) from event")

According to spark metrics I'm reading 18 GB from disk and it takes 2.5 min.

However when I select the top level field:

 spark.sql("select sum(amount) from event")

I read only 2GB in 4 seconds.

From the physical plan I can see that in case of nested structure the whole event struct with all fields are read from parquet, which is a waste.

Parquet format should be able to provide the desired column from nested structure without reading it all (which is the point of columnar store). Is there some way to do this efficiently in Spark ?

814

asked Aug 02 '19 17:08

Tomas Bartalos

1 Answers

Solution:

spark.sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true")
spark.sql("select sum(amount) from (select event.spent.amount as amount from event_archive)")

The query must be written in sub-select fashion. You can't wrap the selected column in aggregate function. Following query will break schema pruning:

select sum(event.spent.amount) as amount from event

Whole schema pruning work is covered in SPARK-4502

Dirty workaround can be also specifying "projected schema" at load time:

val DecimalType = DataTypes.createDecimalType(18, 4)
val schema = StructType(StructField("event", StructType(
      StructField("spent", StructType(
          StructField("amount", DecimalType, true) :: Nil
      ), true) :: Nil
    ), true) :: Nil
  )
 val df = spark.read.format("parquet").schema(schema).load(<path>)

128

answered Oct 21 '22 00:10

Tomas Bartalos

Related questions
                            
                                Spark-Shell: Howto define JAR loading order
                            
                                Lambda Architecture with Apache Spark
                            
                                Spark DataFrames with Parquet and Partitioning
                            
                                Spark metrics on wordcount example
                            
                                Spark: Input a vector
                            
                                Spark example program runs very slow
                            
                                Data shuffle for Hive and Spark window function
                            
                                How to build a sparse matrix in PySpark?
                            
                                Kryo: deserialize old version of class
                            
                                Group by and order by in Spark SQL
                            
                                CodeGen grows beyond 64 KB error when normalizing large PySpark dataframe
                            
                                How to have Apache Spark running on GPU?
                            
                                Read parquet into spark dataset ignoring missing fields [duplicate]
                            
                                How to get the number of records written (using DataFrameWriter's save operation)?
                            
                                Spark - csv read option
                            
                                YARN applications cannot start when specifying YARN node labels
                            
                                Connection from Spark to snowflake
                            
                                Comparing two data frames in Spark (performance)
                            
                                What is the difference between partitioning and bucketing in Spark?
                            
                                How we save a Huge pyspark dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient reading nested parquet column in Spark

Tags:

apache-spark

parquet

Tomas Bartalos

People also ask

1 Answers

Tomas Bartalos

Recent Activity

Donate For Us