Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?

Tags:

Is there any relationship between the number of elements an RDD contained and its ideal number of partitions ?

I have a RDD that has thousand of partitions (because I load it from a source file composed by multiple small files, that's a constraint I can't fix so I have to deal with it). I would like to repartition it (or use the coalescemethod). But I don't know in advance the exact number of events the RDD will contain.
So I would like to do it in an automated way. Something that will look like:

val numberOfElements = rdd.count()
val magicNumber = 100000
rdd.coalesce( numberOfElements / magicNumber)

Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements ?

Thanks.

563

asked Mar 15 '16 11:03

jmvllt

2 Answers

There isn't, because it is highly dependent on application, resources and data. There are some hard limitations (like various 2GB limits) but the rest you have to tune on task to task basis. Some factors to consider:

size of a single row / element
cost of a typical operation. If have small partitions and operations are cheap then scheduling cost can be much higher than the cost of data processing.
cost of processing partition when performing partition-wise (sort for example) operations.

If the core problem here is a number of the initial files then using some variant of CombineFileInputFormat could be a better idea than repartitioning / coalescing. For example:

sc.hadoopFile(
  path,
  classOf[CombineTextInputFormat],
  classOf[LongWritable], classOf[Text]
).map(_._2.toString)

See also How to calculate the best numberOfPartitions for coalesce?

184

answered Nov 07 '22 04:11

zero323

While I'm completely agree with zero323, you still can implement some kind of heuristics. Internally we took size of data stored as avro key-value and compressed and computed number of partitions such that every partition won't be more than 64MB(totalVolume/64MB~number of partitions). Once in a while we run automatic job to recompute "optimal" number of partitions per each type of input etc. In our case it's easy to do since inputs are from hdfs(s3 will work too probably)

Once again it depends on your computation and your data, so your number might be completely different.

answered Nov 07 '22 06:11

Igor Berman

Related questions
                            
                                calculating first quartile for a numeric column in spark
                            
                                How can I create a TF-IDF for Text Classification using Spark?
                            
                                How can spark-shell work without installing Scala beforehand?
                            
                                How to duplicate RDD into multiple RDDs?
                            
                                using pyspark, read/write 2D images on hadoop file system
                            
                                How can I merge spark results files without repartition and copyMerge?
                            
                                Zeppelin SqlContext registerTempTable issue
                            
                                spark + hadoop data locality
                            
                                Error: Must specify a primary resource (JAR or Python or R file) - IPython notebook
                            
                                How to print accumulator variable from within task (seem to "work" without calling value method)?
                            
                                Apache Spark: ERROR local class incompatible when initiating a SparkContext class
                            
                                Saving / exporting transformed DataFrame back to JDBC / MySQL
                            
                                Basic linear algebra on spark matrices
                            
                                Connecting/Integrating Cassandra with Spark (pyspark)
                            
                                How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?
                            
                                Error from python worker: /bin/python: No module named pyspark
                            
                                Spark - Difference between sortBy and sortByKey
                            
                                Connecting IPython notebook to spark master running in different machines
                            
                                Spark - How can get the Logical / Physical Query execution using - Thirft - Hive Interactor
                            
                                Spark DataFrame not respecting schema and considering everything as String

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?

Tags:

apache-spark

apache-spark-sql

partitioning

jmvllt

People also ask

2 Answers

zero323

Igor Berman

Recent Activity

Donate For Us