Are there any implementations of Spark SQL DataSources that offer Co-partition joins - most likely via the CoGroupRDD? I did not see any uses within the existing Spark codebase.
The motivation would be to greatly reduce the shuffle traffic in the case that two tables have the same number and same ranges of partitioning keys: in that case there would be a Mx1 instead of an MxN shuffle fanout.
The only large-scale implementation of joins presently in Spark SQL seems to be ShuffledHashJoin - which does require the MxN shuffle fanout and thus is expensive.
Leave a reply. The RDD's in spark are partitioned, using Hash Partitioner by default. Co-partitioned RDD's uses same partitioner and thus have their data distributed across partitions in same manner.
The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios are implemented in Spark SQL based upon the business use case.
One way to avoid shuffles when joining two datasets is to take advantage of broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor.
I think you are looking for the Bucket Join optimization that should be coming in Spark 2.0.
In 1.6 you can accomplish something similar, but only by caching the data. SPARK-4849
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With