Does Spark maintain parquet partitioning on read?

Tags:

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:

df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")

Now later on I would like to read the parquet file so I do something like this:

val df = spark.read.parquet("/path/to/parquet/file")

Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?

Also the why and why not to this answer would be helpful as well.

524

asked Jun 12 '18 21:06

Adam

1 Answers

The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:

If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.

Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.

Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.

val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))

151

answered Oct 05 '22 14:10

bsplosion

Related questions
                            
                                Marshalling/unmarshalling XML in Scala
                            
                                Unit testing helper or non-interface traits in Scala
                            
                                Any halfway decent jdbc wrappers for Scala?
                            
                                SBT: Dependency On Other SBT Project Without Publishing
                            
                                Scala State monad - combining different state types
                            
                                Transforming Slick Streaming data and sending Chunked Response using Akka Http
                            
                                Java bytecode decompiler in IntelliJIDEA for Scala
                            
                                Scala Eclipse Autocomplete Broken?
                            
                                Is there a way to make the Scala REPL not stop with CTRL-C
                            
                                Setting up Scaladoc for IntelliJ
                            
                                Using Typesafe Config's ConfigFactory to set key setting in build.sbt?
                            
                                When extending a trait within a trait, what does 'super' refer to?
                            
                                Spark structured streaming - join static dataset with streaming dataset
                            
                                Minimal framework in Scala for collections with inheriting return type
                            
                                Message delivery sequence in akka actors
                            
                                What are the weaknesses in using Immutability + Actor model for concurrency programming?
                            
                                How to fix NoSuchMethodError?
                            
                                Error:scalac: bad option: -P (IntelliJ IDEA)
                            
                                Why does this Scala function compile when the argument does not conform to the type constraint?
                            
                                How to find which Java/Scala thread has locked a file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does Spark maintain parquet partitioning on read?

Tags:

scala

apache-spark

parquet

partitioning

Adam

People also ask

1 Answers

bsplosion

Recent Activity

Donate For Us