Co-partitioned joins in spark SQL

Tags:

apache-spark-sql

Are there any implementations of Spark SQL DataSources that offer Co-partition joins - most likely via the CoGroupRDD? I did not see any uses within the existing Spark codebase.

The motivation would be to greatly reduce the shuffle traffic in the case that two tables have the same number and same ranges of partitioning keys: in that case there would be a Mx1 instead of an MxN shuffle fanout.

The only large-scale implementation of joins presently in Spark SQL seems to be ShuffledHashJoin - which does require the MxN shuffle fanout and thus is expensive.

624

asked Mar 04 '15 09:03

WestCoastProjects

1 Answers

I think you are looking for the Bucket Join optimization that should be coming in Spark 2.0.

In 1.6 you can accomplish something similar, but only by caching the data. SPARK-4849

answered Oct 16 '22 19:10

Michael Armbrust

Related questions
                            
                                Spark: Merge 2 dataframes by adding row index/number on both dataframes
                            
                                How to max value and keep all columns (for max records per group)? [duplicate]
                            
                                Set hadoop configuration values on spark-submit command line
                            
                                spark + sbt-assembly: "deduplicate: different file contents found in the following"
                            
                                Spark Dataset select with typedcolumn
                            
                                When are cache and persist executed (since they don't seem like actions)?
                            
                                How to open/stream .zip files through Spark?
                            
                                How to measure the execution time of a query on Spark
                            
                                Apache-Spark : What is map(_._2) shorthand for?
                            
                                scala - Spark : How to union all dataframe in loop
                            
                                Spark MLlib - trainImplicit warning
                            
                                Java heap space OutOfMemoryError in pyspark spark-submit?
                            
                                BigQuery replaced most of my Spark jobs, am I missing something?
                            
                                WARN BlockManagerMasterEndpoint: No more replicas available for rdd
                            
                                Manually calling spark's garbage collection from pyspark
                            
                                javax.servlet.ServletException: java.util.NoSuchElementException: None.get
                            
                                Spark: How to join RDDs by time range
                            
                                Spark executor logs on YARN
                            
                                Spark: Read an inputStream instead of File
                            
                                UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With