I have two rdd's which both are result of a groupby and look like:
[(u'1', [u'0']), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1'])]
and
[(u'1', [u'3', u'4']), (u'0', [u'1', u'2'])]
How can I merge the two and get the following:
[(u'1', [u'0',u'3', u'4']]), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1']),(u'0', [u'1', u'2'])]
I tried the join command but but that did not give me the result that I was looking for. Any help is much appreciated.
PySpark pair RDD – leftOuterJoin() leftOuterJoin() is used to perform left join on pair RDD. left Join results in the RDD by selecting all rows from the first RDD and only matched rows from the second RDD with respect to the rows in the first RDD.
Method 1: Using createDataframe() function. After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the RDD and defined schema for Dataframe.
I solved it using:
rdd2.union(rdd1).reduceByKey(lambda x,y : x+y)
None of the following worked for me:
(rdd1 union rdd2).reduceByKey(_ ++ _)
or
rdd1.join(rdd2).map(case (k, (ls, rs)) => (k, ls ++ rs))
Best of luck to everyone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With