Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which function in spark is used to combine two RDDs by keys

Tags:

Let us say I have the following two RDDs, with the following key-pair values.

rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]

and

rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]

Now, I want to join them by key values, so for example I want to return the following

ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3, value4, value7]) ] 

How I can I do this, in spark using Python or Scala? One way is to use join, but join would create a tuple inside the tuple. But I want to only have one tuple per key value pair.

like image 589
MetallicPriest Avatar asked Nov 13 '14 11:11

MetallicPriest


People also ask

Which of the Spark function merges the RDD values for each key?

Like, in spark paired RDDs reduceByKey() method aggregate data separately for each key. Whereas join() method, merges two RDDs together by grouping elements with the same key.

What RDD transformation can be used to combine two RDDs?

For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When you call join operation on two pair RDDs eg K V and K w What is the result?

Explain join() operation Performs a hash join across the cluster. It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.


1 Answers

I would union the two RDDs and to a reduceByKey to merge the values.

(rdd1 union rdd2).reduceByKey(_ ++ _)
like image 102
maasg Avatar answered Oct 19 '22 00:10

maasg