Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

repartition() is not affecting RDD partition size

I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-

scala> rdd.partitions.size
res56: Int = 50

scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27

At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-

scala> rdd.partitions.size
res58: Int = 50

As one can see, it's not changing. Can someone answer why?

like image 390
Dhiraj Avatar asked Jul 20 '15 03:07

Dhiraj


1 Answers

First, it does matter that you run an action as repartition is indeed lazy. Second, repartition returns a new RDD with the partitioning changed, so you must use the returned RDD or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.

like image 71
Justin Pihony Avatar answered Oct 11 '22 17:10

Justin Pihony