I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-
scala> rdd.partitions.size
res56: Int = 50
scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27
At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-
scala> rdd.partitions.size
res58: Int = 50
As one can see, it's not changing. Can someone answer why?
First, it does matter that you run an action as repartition
is indeed lazy. Second, repartition
returns a new RDD
with the partitioning changed, so you must use the returned RDD
or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce
, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With