I have two RDD[K,V], where K=Long and V=Object. Lets call the rdd1 and rdd2. I have a common custom Partitioner. I am trying to find a way to take union or join by avoiding or minimizing data movement.
val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */
val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))
val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle
Is it safe to assume (or a way to enforce) the nth-Partition of both rdd1 and rdd2 on same slave node?
It is not possible to enforce* colocation in Spark but the method you use will minimize data movement. When PartitionerAwareUnionRDD is created input RDDs are analyzed to choose optimal output locations based on the number of records per location. See getPreferredLocations method for details.
* According to High Performance Spark
Two RDDs will be colocated if they have the same partitioner and were shuffled as part of the same action.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With