Save the parquet output file with fixed size in spark

Tags:

I have 160GB of data,partition on DATE Column and storing in parquet file format running on spark 1.6.0. I need to store the output parquet files with equal sized files in each partition with fixed size say like 100MB each.

I tried with below code:

val blockSize= 1024*1024*100
sc.hadoopConfiguration.setInt("dfs.blocksize", blockSize)
sc.hadoopConfiguration.setInt("parquet.block.size",blockSize)

df1.write.partitionBy("DATE").parquet("output_file_path")

The above configuration is not working, it is creating multiple files with default number of partitions,not the 100 MB file.

800

asked Apr 13 '18 23:04

warner

1 Answers

Its not possible to get the exact same size for every file, but you can give enough hints to Spark to make them "within" a certain size. The general goal is to make each file equal to the HDFS block size and each file holds one (or more) row group. You want the row group to fit in one HDFS block. If a row group does not fit in one block, you have a situation where additional network calls needs to be made to read another HDFS block to completely read the row group.

To achieve this, do the following:

Set spark.sql.files.maxPartitionBytes in spark conf to 256 MB (equal to your HDFS block size)
Set parquet.block.size on the parquet writer options in Spark to 256 MB.

tradesDF.write.option("parquet.block.size", 256 * 1024 * 1024)

164

answered Sep 21 '22 11:09

IceMan

Related questions
                            
                                Unbounded table is spark structured streaming
                            
                                Visualizing topics with Spark LDA
                            
                                R - How to replicate rows in a spark dataframe using sparklyr
                            
                                Scala - How to split the probability column (column of vectors) that we obtain when we fit the GMM model to the data in to two separate columns? [duplicate]
                            
                                How does Spark SQL read compressed csv files?
                            
                                S3A: fails while S3: works in Spark EMR
                            
                                with pyspark.sql.functions unix_timestamp get null
                            
                                Streaming data store in hive using spark
                            
                                How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
                            
                                reuse the result of a select expression in the "GROUP BY" clause?
                            
                                Spark DataFrame operators (nunique, multiplication)
                            
                                Is it possible to print definition of a function in Scala
                            
                                read/write dynamo db from apache spark [closed]
                            
                                java.lang.IllegalArgumentException: Invalid lambda deserialization
                            
                                Pyspark Dataframe - Map Strings to Numerics
                            
                                After installing sparknlp, cannot import sparknlp
                            
                                How to achieve dynamic load-balancing of tasks in Apache Spark
                            
                                How to calculate the power of 2 for the column of DataFrame
                            
                                Can num-executors override dynamic allocation in spark-submit
                            
                                why does spark appends 'WHERE 1=0' at the end of sql query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Save the parquet output file with fixed size in spark

Tags:

apache-spark

apache-spark-sql

warner

People also ask

1 Answers

IceMan

Recent Activity

Donate For Us