Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Co-partitioned joins in spark SQL

Are there any implementations of Spark SQL DataSources that offer Co-partition joins - most likely via the CoGroupRDD? I did not see any uses within the existing Spark codebase.

The motivation would be to greatly reduce the shuffle traffic in the case that two tables have the same number and same ranges of partitioning keys: in that case there would be a Mx1 instead of an MxN shuffle fanout.

The only large-scale implementation of joins presently in Spark SQL seems to be ShuffledHashJoin - which does require the MxN shuffle fanout and thus is expensive.

like image 624
WestCoastProjects Avatar asked Mar 04 '15 09:03

WestCoastProjects


People also ask

What is co partitioning Spark?

Leave a reply. The RDD's in spark are partitioned, using Hash Partitioner by default. Co-partitioned RDD's uses same partitioner and thus have their data distributed across partitions in same manner.

What type of joins are present in Spark?

The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios are implemented in Spark SQL based upon the business use case.

How do you reduce shuffling in performing a join of two Spark datasets?

One way to avoid shuffles when joining two datasets is to take advantage of broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor.


1 Answers

I think you are looking for the Bucket Join optimization that should be coming in Spark 2.0.

In 1.6 you can accomplish something similar, but only by caching the data. SPARK-4849

like image 74
Michael Armbrust Avatar answered Oct 16 '22 19:10

Michael Armbrust