Let us say I have the following two RDDs, with the following key-pair values.
rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]
and
rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]
Now, I want to join them by key values, so for example I want to return the following
ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3, value4, value7]) ]
How I can I do this, in spark using Python or Scala? One way is to use join, but join would create a tuple inside the tuple. But I want to only have one tuple per key value pair.
Like, in spark paired RDDs reduceByKey() method aggregate data separately for each key. Whereas join() method, merges two RDDs together by grouping elements with the same key.
For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
Explain join() operation Performs a hash join across the cluster. It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
I would union the two RDDs and to a reduceByKey to merge the values.
(rdd1 union rdd2).reduceByKey(_ ++ _)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With