To test how .repartition()
works, I ran the following code:
rdd = sc.parallelize(range(100))
rdd.getNumPartitions()
rdd.getNumPartitions()
resulted in 4
. Then I ran:
rdd = rdd.repartition(10)
rdd.getNumPartitions()
rdd.getNumPartitions()
this time resulted in 10
, so there were now 10 partitions.
However, I checked the partitions by:
rdd.glom().collect()
The result gave 4 non-empty lists and 6 empty lists. Why haven't any elements been distributed to the other 6 lists?
You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution.
In order to achieve high parallelism, Spark will split the data into smaller chunks called partitions which are distributed across different nodes in the Spark Cluster. Every node, can have more than one executor each of which can execute a task.
The general recommendation for Spark is to have 4x of partitions to the number of cores in cluster available for application, and for upper bound — the task should take 100ms+ time to execute.
Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.
The algorithm behind repartition() uses logic to optimize the most effective way to redistribute data across partitions. In this case, your range is very small and it doesn't find it optimal to actually break the data down further. If you were to use a much bigger range like 100000, you will find that it does in fact redistribute the data.
If you want to force a certain amount of partitions, you could specify the number of partitions upon the intial load of the data. At this point, it will try to evenly distribute the data across partitions even if it's not necessarily optimal. The parallelize function takes a second argument for partitions
rdd = sc.parallelize(range(100), 10)
The same thing would work if you were to say read from a text file
rdd = sc.textFile('path/to/file/, numPartitions)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With