Will rdd1.join(rdd2)
cause a shuffle to happen if rdd1
and rdd2
have the same partitioner?
Transformations which can cause a shuffle include repartition operations like repartition and coalesce , 'ByKey operations (except for counting) like groupByKey and reduceByKey , and join operations like cogroup and join .
1. repartition, join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.
Coalesce doesn't involve a full shuffle. If the number of partitions is reduced from 5 to 2. Coalesce will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors. Thereby avoiding a full shuffle.
No. If two RDDs have the same partitioner, the join
will not cause a shuffle. You can see this in CoGroupedRDD.scala
:
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
}
}
}
Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With