Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the difference between rdd.repartition() and partition size in sc.parallelize(data, partitions)

I was going through the documentation of spark. I got a bit confused with rdd.repartition() function and the number of partitions we pass during context initialization in sc.parallelize().

I have 4 cores on my machine, if I sc.parallelize(data, 4) everything works fine, but when I rdd.repartition(4) and apply rdd.mappartitions(fun) sometimes the partitions has no data and my function fails in such cases.

So, just wanted to understand what is the difference between these two ways of partitioning.

like image 221
ansrivas Avatar asked Oct 20 '22 04:10

ansrivas


1 Answers

By calling repartition(N) spark will do a shuffle to change the number of partitions (and will by default result in a HashPartitioner with that number of partitions). When you call sc.parallelize with a desired number of partitions it splits your data (more or less) equally up amongst the slices (effectively similar to a range partitioner), you can see this in ParallelCollectionRDD inside of the slice function.

That being said, it is possible that both of these sc.parallelize(data, N) and rdd.reparitition(N) (and really almost any form of reading in data) can result in RDDs with empty partitions (its a pretty common source of errors with mapPartitions code so I biased the RDD generator in spark-testing-base to create RDDs with empty partitions). A really simple fix for most functions is just checking if you've been passed in an empty iterator and just returning an empty iterator in that case.

like image 97
Holden Avatar answered Oct 22 '22 09:10

Holden