Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:

df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")

Now later on I would like to read the parquet file so I do something like this:

val df = spark.read.parquet("/path/to/parquet/file")

Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?

Also the why and why not to this answer would be helpful as well.

like image 524
Adam Avatar asked Jun 12 '18 21:06

Adam


People also ask

Are Parquet files partitioned?

As a reminder, Parquet files are partitioned. When we say “Parquet file”, we are actually referring to multiple physical files, each of them being a partition. This directory structure makes it easy to add new data every day, but it only works well when you make time-based analysis.

How many partitions we get when we create Spark Dataframe by reading Parquet file stored in HDFS location?

The spark partitioning method will show an output of 6 partitions, for the RDD that we created. Task scheduling may take more time than the actual execution time if RDD has too many partitions.

What happens when Spark reads a file?

Thanks for guidance but here when Spark read data/file , definitely it will store that data which it has read. So where it will store this data.. If it won't store so what is happening on reading the file.

How the number of partitions is decided by Spark when a file is read?

The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all.


1 Answers

The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:

  1. If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
  2. If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
  3. Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.

Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.

Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.

val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
like image 151
bsplosion Avatar answered Oct 05 '22 14:10

bsplosion