I have two RDD[K,V]
, where K=Long
and V=Object
. Lets call the rdd1
and rdd2
. I have a common custom Partitioner. I am trying to find a way to take union
or join
by avoiding or minimizing data movement.
val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */
val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))
val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle
Is it safe to assume (or a way to enforce) the nth-Partition
of both rdd1
and rdd2
on same slave
node?
It is not possible to enforce* colocation in Spark but the method you use will minimize data movement. When PartitionerAwareUnionRDD
is created input RDDs
are analyzed to choose optimal output locations based on the number of records per location. See getPreferredLocations
method for details.
* According to High Performance Spark
Two RDDs will be colocated if they have the same partitioner and were shuffled as part of the same action.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With