In spark join, does table order matter like in pig?

Tags:

Related to Spark - Joining 2 PairRDD elements

When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly better to do join A, B than join A by B, from performance perspective (avoiding spill and OOM)

Is there a similar concept in spark? I didn't see any such recommendation, and wonder how is it possible? The implementation looks to me pretty much the same as in pig: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala

Or am I missing something?

502

asked Feb 24 '15 11:02

ihadanny

1 Answers

It does not make a difference, in spark the RDD will only be brought into memory if it is cached. So in spark to achieve the same effect you can cache the smaller RDD. Another thing you can do in spark which I'm not sure that pig does, is if all RDD's being joined have the same partitioner no shuffle needs to be done.

151

answered Sep 24 '22 11:09

aaronman

Related questions
                            
                                Rancher template - Hadoop Illegal character in host-name
                            
                                Running "hbase shell" giving error in OSX
                            
                                Does Hive have a dynamic pivot function
                            
                                How to run simple Spark app from Eclipse/Intellij IDE?
                            
                                Accessing HBase tables through Spark
                            
                                Cross Product in Map Reduce using Hadoop Streaming and Python
                            
                                Hadoop: binding multiple IP addresses to a cluster NameNode
                            
                                Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?
                            
                                HIVE insert overwrite directory with json format
                            
                                Streaming or custom Jar in Hadoop
                            
                                how to design Hbase schema?
                            
                                Is there any distributed file system which runs on Windows except Hadoop? [closed]
                            
                                The node /hbase is not in ZooKeeper
                            
                                Big data signal analysis: better way to store and query signal data
                            
                                Classify data using Apache Mahout
                            
                                How would you suggest performing "Join" with Hadoop streaming?
                            
                                HBase ERROR: hbase-default.xml file seems to be for and old version of HBase (null)
                            
                                Bypassing org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://[...] matches 0 files
                            
                                Hive create table with inputs from nested sub-directories

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In spark join, does table order matter like in pig?

Tags:

apache-spark

hadoop

apache-pig

bigdata

ihadanny

People also ask

1 Answers

aaronman

Recent Activity

Donate For Us