Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have:

sc = SparkContext() sc.parallelize(xrange(0, 10), 4) 

How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?

like image 782
mar tin Avatar asked Mar 04 '16 16:03

mar tin


People also ask

How many partitions should a Spark RDD have?

In a Spark RDD, a number of partitions can always be monitor by using the partitions method of RDD. The spark partitioning method will show an output of 6 partitions, for the RDD that we created.

What is default number of partitions in RDD?

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

How many partitions are there in Spark?

Spark Shuffle operations move the data from one partition to other partitions. By default, DataFrame shuffle operations create 200 partitions.

How do I know how many partitions I have in RDD?

(a) If parent RDD has a partitioner on aggregation key(s), then the number of partitions in the aggregated RDD is equal to the number of partitions in the parent RDD. (b) If parent RDD does not have a partitioner, then the number of partitions in the aggregated RDD is equal to the value of 'spark.


1 Answers

The primary effect would be by specifying too few partitions or far too many partitions.

Too few partitions You will not utilize all of the cores available in the cluster.

Too many partitions There will be excessive overhead in managing many small tasks.

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.

like image 173
WestCoastProjects Avatar answered Oct 12 '22 02:10

WestCoastProjects