Spark DataFrames with Parquet and Partitioning

Question

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.

So let me clarify, parquet compressed (these numbers are not fully accurate). 1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.

Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?

Assumptions:

1 Block = 1 Partition For Spark
1 Core operated on 1 Partition

kostya · Accepted Answer

Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).

Spark 1.5 DataFrame partitions parquet file as follows:

1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size

Spark DataFrames with Parquet and Partitioning

Tags:

apache-spark

apache-spark-sql

parquet

theMadKing

1 Answers

kostya

Recent Activity

Donate For Us

Spark DataFrames with Parquet and Partitioning

Tags:

apache-spark

apache-spark-sql

parquet

theMadKing

1 Answers

kostya

Related questions

Recent Activity

Donate For Us