I was going through the documentation of spark. I got a bit confused with rdd.repartition() function and the number of partitions we pass during context initialization in sc.parallelize().
I have 4 cores on my machine, if I sc.parallelize(data, 4) everything works fine, but when I rdd.repartition(4) and apply rdd.mappartitions(fun) sometimes the partitions has no data and my function fails in such cases.
So, just wanted to understand what is the difference between these two ways of partitioning.
By calling repartition(N)
spark will do a shuffle to change the number of partitions (and will by default result in a HashPartitioner with that number of partitions). When you call sc.parallelize
with a desired number of partitions it splits your data (more or less) equally up amongst the slices (effectively similar to a range partitioner), you can see this in ParallelCollectionRDD
inside of the slice
function.
That being said, it is possible that both of these sc.parallelize(data, N)
and rdd.reparitition(N)
(and really almost any form of reading in data) can result in RDDs with empty partitions (its a pretty common source of errors with mapPartitions
code so I biased the RDD generator in spark-testing-base to create RDDs with empty partitions). A really simple fix for most functions is just checking if you've been passed in an empty iterator and just returning an empty iterator in that case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With