I have 160GB of data,partition on DATE
Column and storing in parquet file format running on spark 1.6.0.
I need to store the output parquet files with equal sized files in each partition with fixed size say like 100MB each.
I tried with below code:
val blockSize= 1024*1024*100
sc.hadoopConfiguration.setInt("dfs.blocksize", blockSize)
sc.hadoopConfiguration.setInt("parquet.block.size",blockSize)
df1.write.partitionBy("DATE").parquet("output_file_path")
The above configuration is not working, it is creating multiple files with default number of partitions,not the 100 MB file.
Coalesce hints allows the Spark SQL users to control the number of output files just like the coalesce , repartition and repartitionByRange in Dataset API, they can be used for performance tuning and reducing the number of output files. The “COALESCE” hint only has a partition number as a parameter.
Its not possible to get the exact same size for every file, but you can give enough hints to Spark to make them "within" a certain size. The general goal is to make each file equal to the HDFS block size and each file holds one (or more) row group. You want the row group to fit in one HDFS block. If a row group does not fit in one block, you have a situation where additional network calls needs to be made to read another HDFS block to completely read the row group.
To achieve this, do the following:
tradesDF.write.option("parquet.block.size", 256 * 1024 * 1024)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With