Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to repartition evenly in Spark?

To test how .repartition() works, I ran the following code:

rdd = sc.parallelize(range(100))
rdd.getNumPartitions()

rdd.getNumPartitions() resulted in 4. Then I ran:

rdd = rdd.repartition(10)
rdd.getNumPartitions()

rdd.getNumPartitions() this time resulted in 10, so there were now 10 partitions.

However, I checked the partitions by:

rdd.glom().collect()

The result gave 4 non-empty lists and 6 empty lists. Why haven't any elements been distributed to the other 6 lists?

like image 970
cshin9 Avatar asked Jun 29 '16 14:06

cshin9


People also ask

How do you distribute data evenly in Spark?

You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution.

How do you effectively repartition a Spark data frame?

In order to achieve high parallelism, Spark will split the data into smaller chunks called partitions which are distributed across different nodes in the Spark Cluster. Every node, can have more than one executor each of which can execute a task.

What is a good number of partitions in Spark?

The general recommendation for Spark is to have 4x of partitions to the number of cores in cluster available for application, and for upper bound — the task should take 100ms+ time to execute.

How do I reduce the number of partitions in Spark?

Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.


1 Answers

The algorithm behind repartition() uses logic to optimize the most effective way to redistribute data across partitions. In this case, your range is very small and it doesn't find it optimal to actually break the data down further. If you were to use a much bigger range like 100000, you will find that it does in fact redistribute the data.

If you want to force a certain amount of partitions, you could specify the number of partitions upon the intial load of the data. At this point, it will try to evenly distribute the data across partitions even if it's not necessarily optimal. The parallelize function takes a second argument for partitions

    rdd = sc.parallelize(range(100), 10)

The same thing would work if you were to say read from a text file

    rdd = sc.textFile('path/to/file/, numPartitions)
like image 77
user3124181 Avatar answered Sep 27 '22 18:09

user3124181