Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?

Is there any relationship between the number of elements an RDD contained and its ideal number of partitions ?

I have a RDD that has thousand of partitions (because I load it from a source file composed by multiple small files, that's a constraint I can't fix so I have to deal with it). I would like to repartition it (or use the coalescemethod). But I don't know in advance the exact number of events the RDD will contain.
So I would like to do it in an automated way. Something that will look like:

val numberOfElements = rdd.count()
val magicNumber = 100000
rdd.coalesce( numberOfElements / magicNumber)

Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements ?

Thanks.

like image 563
jmvllt Avatar asked Mar 15 '16 11:03

jmvllt


People also ask

How many partitions should a Spark RDD have?

In a Spark RDD, a number of partitions can always be monitor by using the partitions method of RDD. The spark partitioning method will show an output of 6 partitions, for the RDD that we created.

How does Spark determine number of partitions?

The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all.

What is ideal partition size in Spark?

The ideal size of each partition is around 100-200 MB. The smaller size of partitions will increase the parallel running jobs, which can improve performance, but too small of a partition will cause overhead and increasing the GC time.

Which method is used to verify the number of partitions in RDD?

coalesce vs repartition The coalesce() and repartition() transformations are both used for changing the number of partitions in the RDD.


2 Answers

There isn't, because it is highly dependent on application, resources and data. There are some hard limitations (like various 2GB limits) but the rest you have to tune on task to task basis. Some factors to consider:

  • size of a single row / element
  • cost of a typical operation. If have small partitions and operations are cheap then scheduling cost can be much higher than the cost of data processing.
  • cost of processing partition when performing partition-wise (sort for example) operations.

If the core problem here is a number of the initial files then using some variant of CombineFileInputFormat could be a better idea than repartitioning / coalescing. For example:

sc.hadoopFile(
  path,
  classOf[CombineTextInputFormat],
  classOf[LongWritable], classOf[Text]
).map(_._2.toString)

See also How to calculate the best numberOfPartitions for coalesce?

like image 184
zero323 Avatar answered Nov 07 '22 04:11

zero323


While I'm completely agree with zero323, you still can implement some kind of heuristics. Internally we took size of data stored as avro key-value and compressed and computed number of partitions such that every partition won't be more than 64MB(totalVolume/64MB~number of partitions). Once in a while we run automatic job to recompute "optimal" number of partitions per each type of input etc. In our case it's easy to do since inputs are from hdfs(s3 will work too probably)

Once again it depends on your computation and your data, so your number might be completely different.

like image 20
Igor Berman Avatar answered Nov 07 '22 06:11

Igor Berman