Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

1 Answers

No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala:

override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
    }
  }
}

Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).

This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.

answered Oct 06 '22 00:10

Daniel Darabos

Related questions
                            
                                Filter rows in Spark dataframe from the words in RDD
                            
                                Saving ordered dataframe in Spark
                            
                                How to debug the function passed to mapPartitions
                            
                                Remove new line from CSV file
                            
                                Connect to spark cluster from local jupyter notebook
                            
                                Pyspark > Dataframe with multiple array columns into multiple rows with one value each
                            
                                How to keep the Spark web UI alive?
                            
                                Spark application throws javax.servlet.FilterRegistration
                            
                                Using partitionBy on a DataFrameWriter writes directory layout with column names not just values
                            
                                What is the difference between an RDD partition and a slice?
                            
                                How do I call a UDF on a Spark DataFrame using JAVA?
                            
                                Pyspark dataframe convert multiple columns to float
                            
                                Are failed tasks resubmitted in Apache Spark?
                            
                                Comparing columns in Pyspark
                            
                                ValueError: Cannot run multiple SparkContexts at once in spark with pyspark
                            
                                Failed to bind to: spark-master, using a remote cluster with two workers
                            
                                Apache Spark: network errors between executors
                            
                                Spark iteration time increasing exponentially when using join
                            
                                How to extract an element from a array in pyspark
                            
                                Spark cache vs broadcast

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

Tags:

apache-spark

rdd

spark-streaming

zwb

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us