I have two RDD's that I want to join and they look like this: <pre class="prettyprint"><code>val rdd1:RDD[(T,U)] val rdd2:RDD[((T,W), V)] </code></pre> It happens to be the case that the key values of <code>rdd1</code> are unique and also that the tuple-key values of <code>rdd2</code> are unique. I'd like to join the two data sets so that I get the following rdd: <pre class="prettyprint"><code>val rdd_joined:RDD[((T,W), (U,V))] </code></pre> What's the most efficient way to achieve this? Here are a few ideas I've thought of. Option 1: <pre class="prettyprint"><code>val m = rdd1.collectAsMap val rdd_joined = rdd2.map({case ((t,w), u) => ((t,w), u, m.get(t))}) </code></pre> Option 2: <pre class="prettyprint"><code>val distinct_w = rdd2.map({case ((t,w), u) => w}).distinct val rdd_joined = rdd1.cartesian(distinct_w).join(rdd2) </code></pre> Option 1 will collect all of the data to master, right? So that doesn't seem like a good option if rdd1 is large (it's relatively large in my case, although an order of magnitude smaller than rdd2). Option 2 does an ugly distinct and cartesian product, which also seems very inefficient. Another possibility that crossed my mind (but haven't tried yet) is to do option 1 and broadcast the map, although it would be better to broadcast in a "smart" way so that the keys of the map are co-located with the keys of <code>rdd2</code>. Has anyone come across this sort of situation before? I'd be happy to have your thoughts. Thanks!

One option is to perform a broadcast join by collecting <code>rdd1</code> to the driver and broadcasting it to all mappers; done correctly, this will let us avoid an expensive shuffle of the large <code>rdd2</code> RDD: <pre class="prettyprint"><code>val rdd1 = sc.parallelize(Seq((1, "A"), (2, "B"), (3, "C"))) val rdd2 = sc.parallelize(Seq(((1, "Z"), 111), ((1, "ZZ"), 111), ((2, "Y"), 222), ((3, "X"), 333))) val rdd1Broadcast = sc.broadcast(rdd1.collectAsMap()) val joined = rdd2.mapPartitions({ iter => val m = rdd1Broadcast.value for { ((t, w), u) <- iter if m.contains(t) } yield ((t, w), (u, m.get(t).get)) }, preservesPartitioning = true) </code></pre> The <code>preservesPartitioning = true</code> tells Spark that this map function doesn't modify the keys of <code>rdd2</code>; this will allow Spark to avoid re-partitioning <code>rdd2</code> for any subsequent operations that join based on the <code>(t, w)</code> key. This broadcast could be inefficient since it involves a communications bottleneck at the driver. In principle, it's possible to broadcast one RDD to another without involving the driver; I have a prototype of this that I'd like to generalize and add to Spark. Another option is to re-map the keys of <code>rdd2</code> and use the Spark <code>join</code> method; this will involve a full shuffle of <code>rdd2</code> (and possibly <code>rdd1</code>): <pre class="prettyprint"><code>rdd1.join(rdd2.map { case ((t, w), u) => (t, (w, u)) }).map { case (t, (v, (w, u))) => ((t, w), (u, v)) }.collect() </code></pre> On my sample input, both of these methods produce the same result: <pre class="prettyprint"><code>res1: Array[((Int, java.lang.String), (Int, java.lang.String))] = Array(((1,Z),(111,A)), ((1,ZZ),(111,A)), ((2,Y),(222,B)), ((3,X),(333,C))) </code></pre> A third option would be to restructure <code>rdd2</code> so that <code>t</code> is its key, then perform the above join.

Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

Tags:

scala

apache-spark

I have two RDD's that I want to join and they look like this:

val rdd1:RDD[(T,U)] val rdd2:RDD[((T,W), V)]

It happens to be the case that the key values of rdd1 are unique and also that the tuple-key values of rdd2 are unique. I'd like to join the two data sets so that I get the following rdd:

val rdd_joined:RDD[((T,W), (U,V))]

What's the most efficient way to achieve this? Here are a few ideas I've thought of.

Option 1:

val m = rdd1.collectAsMap val rdd_joined = rdd2.map({case ((t,w), u) => ((t,w), u, m.get(t))})

Option 2:

val distinct_w = rdd2.map({case ((t,w), u) => w}).distinct val rdd_joined = rdd1.cartesian(distinct_w).join(rdd2)

Option 1 will collect all of the data to master, right? So that doesn't seem like a good option if rdd1 is large (it's relatively large in my case, although an order of magnitude smaller than rdd2). Option 2 does an ugly distinct and cartesian product, which also seems very inefficient. Another possibility that crossed my mind (but haven't tried yet) is to do option 1 and broadcast the map, although it would be better to broadcast in a "smart" way so that the keys of the map are co-located with the keys of rdd2.

Has anyone come across this sort of situation before? I'd be happy to have your thoughts.

Thanks!

651

asked Jul 12 '13 18:07

RyanH

1 Answers

One option is to perform a broadcast join by collecting rdd1 to the driver and broadcasting it to all mappers; done correctly, this will let us avoid an expensive shuffle of the large rdd2 RDD:

val rdd1 = sc.parallelize(Seq((1, "A"), (2, "B"), (3, "C"))) val rdd2 = sc.parallelize(Seq(((1, "Z"), 111), ((1, "ZZ"), 111), ((2, "Y"), 222), ((3, "X"), 333)))  val rdd1Broadcast = sc.broadcast(rdd1.collectAsMap()) val joined = rdd2.mapPartitions({ iter =>   val m = rdd1Broadcast.value   for {     ((t, w), u) <- iter     if m.contains(t)   } yield ((t, w), (u, m.get(t).get)) }, preservesPartitioning = true)

The preservesPartitioning = true tells Spark that this map function doesn't modify the keys of rdd2; this will allow Spark to avoid re-partitioning rdd2 for any subsequent operations that join based on the (t, w) key.

This broadcast could be inefficient since it involves a communications bottleneck at the driver. In principle, it's possible to broadcast one RDD to another without involving the driver; I have a prototype of this that I'd like to generalize and add to Spark.

Another option is to re-map the keys of rdd2 and use the Spark join method; this will involve a full shuffle of rdd2 (and possibly rdd1):

rdd1.join(rdd2.map {   case ((t, w), u) => (t, (w, u)) }).map {   case (t, (v, (w, u))) => ((t, w), (u, v)) }.collect()

On my sample input, both of these methods produce the same result:

res1: Array[((Int, java.lang.String), (Int, java.lang.String))] = Array(((1,Z),(111,A)), ((1,ZZ),(111,A)), ((2,Y),(222,B)), ((3,X),(333,C)))

A third option would be to restructure rdd2 so that t is its key, then perform the above join.

114

answered Oct 03 '22 08:10

Josh Rosen

Related questions
                            
                                Should exceptions be case classes?
                            
                                How to convert String to date time in Scala?
                            
                                How to use switch/case (simple pattern matching) in Scala?
                            
                                Does Scala have guards?
                            
                                forall in Scala
                            
                                Joining Spark dataframes on the key
                            
                                Compose and andThen methods
                            
                                Using Scala from Java: passing functions as parameters
                            
                                Identify and describe Scala's generic type constraints
                            
                                Comparing collection contents with ScalaTest
                            
                                Why is foreach better than get for Scala Options?
                            
                                How to implement Map with default operation in Scala
                            
                                Why aren't static methods considered good OO practice? [closed]
                            
                                What can I do to my scala code so it will compile faster?
                            
                                How to use Scala in IntelliJ IDEA (or: why is it so difficult to get a working IDE for Scala)?
                            
                                What are your experiences developing in Scala/Lift?
                            
                                String interpolation in Scala 2.10 - How to interpolate a String variable?
                            
                                Scala actors - worst practices? [closed]
                            
                                Testing an assertion that something must not compile
                            
                                Idiomatic way to update value in a Map based on previous value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With