Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner?

like image 835
zwb Avatar asked Feb 08 '15 14:02

zwb


People also ask

What triggers a shuffle in spark?

Transformations which can cause a shuffle include repartition operations like repartition and coalesce , 'ByKey operations (except for counting) like groupByKey and reduceByKey , and join operations like cogroup and join .

Does Join Cause shuffle?

1. repartition, join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.

What will avoid full shuffle in spark if partitions are set to be decreased?

Coalesce doesn't involve a full shuffle. If the number of partitions is reduced from 5 to 2. Coalesce will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors. Thereby avoiding a full shuffle.


1 Answers

No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala:

override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
    }
  }
}

Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).

This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.

like image 95
Daniel Darabos Avatar answered Oct 06 '22 00:10

Daniel Darabos