I am trying to leverage spark partitioning. I was trying to do something like <pre class="prettyprint"><code>data.write.partitionBy("key").parquet("/location") </code></pre> The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried <pre class="prettyprint"><code>data.coalese(numPart).write.partitionBy("key").parquet("/location") </code></pre> This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all the partition coalesce to a certain number and store at a separate location. How should I use partitioning to avoid many files after write?

First I would really avoid using <code>coalesce</code>, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing many small files): <pre class="prettyprint"><code>data.repartition($"key").write.partitionBy("key").parquet("/location") </code></pre> If you want to set an arbitrary number of files (or files which have all the same size), you need to further repartition your data using another attribute which could be used (I cannot tell you what this might be in your case): <pre class="prettyprint"><code>data.repartition($"key",$"another_key").write.partitionBy("key").parquet("/location") </code></pre> <code>another_key</code> could be another attribute of your dataset, or a derived attribute using some modulo or rounding-operations on existing attributes. You could even use window-functions with <code>row_number</code> over <code>key</code> and then round this by something like <pre class="prettyprint"><code>data.repartition($"key",floor($"row_number"/N)*N).write.partitionBy("key").parquet("/location") </code></pre> This would put you <code>N</code> records into 1 parquet file using orderBy You can also control the number of files without repartitioning by ordering your dataframe accordingly: <pre class="prettyprint"><code>data.orderBy($"key").write.partitionBy("key").parquet("/location") </code></pre> This will lead to a total of (at least, but not much more than) <code>spark.sql.shuffle.partitions</code> files across all partitions (by default 200). It's even beneficial to add a second ordering column after <code>$key</code>, as parquet will remember the ordering of the dataframe and will write the statistics accordingly. For example, you can order by an ID: <pre class="prettyprint"><code>data.orderBy($"key",$"id").write.partitionBy("key").parquet("/location") </code></pre> This will not change the number of files, but it will improve the performance when you query your parquet file for a given <code>key</code> and <code>id</code>. See e.g. https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide and https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example Spark 2.2+ From Spark 2.2 on, you can also play with the new option <code>maxRecordsPerFile</code> to limit the number of records per file if you have too large files. You will still get at least N files if you have N partitions, but you can split the file written by 1 partition (task) into smaller chunks: <pre class="prettyprint"><code>df.write .option("maxRecordsPerFile", 10000) ... </code></pre> See e.g. http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ and spark write to disk with N files less than N partitions

Let's expand on Raphael Roth's answer with an additional approach that'll create an upper bound on the number of files each partition can contain, as discussed in this answer: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.rand df.repartition(numPartitions, $"some_col", rand) .write.partitionBy("some_col") .parquet("partitioned_lake") </code></pre>

Spark parquet partitioning : Large number of files

Tags:

apache-spark

rdd

spark-dataframe

bigdata

apache-spark-2.0

I am trying to leverage spark partitioning. I was trying to do something like

Click to copy

data.write.partitionBy("key").parquet("/location")

The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory.

To avoid that I tried

Click to copy

data.coalese(numPart).write.partitionBy("key").parquet("/location")

This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all the partition coalesce to a certain number and store at a separate location.

How should I use partitioning to avoid many files after write?

349

asked Jun 28 '17 16:06

Avishek Bhattacharya

2 Answers

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark))

Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing many small files):

Click to copy

data.repartition($"key").write.partitionBy("key").parquet("/location")

If you want to set an arbitrary number of files (or files which have all the same size), you need to further repartition your data using another attribute which could be used (I cannot tell you what this might be in your case):

Click to copy

data.repartition($"key",$"another_key").write.partitionBy("key").parquet("/location")

another_key could be another attribute of your dataset, or a derived attribute using some modulo or rounding-operations on existing attributes. You could even use window-functions with row_number over key and then round this by something like

Click to copy

data.repartition($"key",floor($"row_number"/N)*N).write.partitionBy("key").parquet("/location")

This would put you N records into 1 parquet file

using orderBy

You can also control the number of files without repartitioning by ordering your dataframe accordingly:

Click to copy

data.orderBy($"key").write.partitionBy("key").parquet("/location")

This will lead to a total of (at least, but not much more than) spark.sql.shuffle.partitions files across all partitions (by default 200). It's even beneficial to add a second ordering column after $key, as parquet will remember the ordering of the dataframe and will write the statistics accordingly. For example, you can order by an ID:

Click to copy

data.orderBy($"key",$"id").write.partitionBy("key").parquet("/location")

This will not change the number of files, but it will improve the performance when you query your parquet file for a given key and id. See e.g. https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide and https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

Spark 2.2+

From Spark 2.2 on, you can also play with the new option maxRecordsPerFile to limit the number of records per file if you have too large files. You will still get at least N files if you have N partitions, but you can split the file written by 1 partition (task) into smaller chunks:

Click to copy

df.write .option("maxRecordsPerFile", 10000) ...

See e.g. http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ and spark write to disk with N files less than N partitions

answered Oct 03 '22 14:10

Raphael Roth

Let's expand on Raphael Roth's answer with an additional approach that'll create an upper bound on the number of files each partition can contain, as discussed in this answer:

Click to copy

import org.apache.spark.sql.functions.rand  df.repartition(numPartitions, $"some_col", rand)   .write.partitionBy("some_col")   .parquet("partitioned_lake")

answered Oct 03 '22 14:10

Powers

Related questions
                            
                                Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
                            
                                How to optimize shuffle spill in Apache Spark application
                            
                                What is the Spark DataFrame method `toPandas` actually doing?
                            
                                Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
                            
                                Installing of SparkR
                            
                                Flattening Rows in Spark
                            
                                dataframe: how to groupBy/count then filter on count in Scala
                            
                                Spark Window Functions - rangeBetween dates
                            
                                What is the difference between cube, rollup and groupBy operators?
                            
                                Reduce a key-value pair into a key-list pair with Apache Spark
                            
                                How to deal with executor memory and driver memory in Spark?
                            
                                How to reduce the verbosity of Spark's runtime output?
                            
                                Spark iterate HDFS directory
                            
                                Spark unionAll multiple dataframes
                            
                                get datatype of column using pyspark
                            
                                Spark specify multiple column conditions for dataframe join
                            
                                How to export data from Spark SQL to CSV
                            
                                What's the difference between Spark ML and MLLIB packages
                            
                                How to assign unique contiguous numbers to elements in a Spark RDD
                            
                                Filtering DataFrame using the length of a column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark parquet partitioning : Large number of files

Tags:

apache-spark

rdd

spark-dataframe

bigdata

apache-spark-2.0

Avishek Bhattacharya

People also ask

2 Answers

Raphael Roth

Powers

Recent Activity

Donate For Us