Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark merge two rdd together

I have two rdd's which both are result of a groupby and look like:

[(u'1', [u'0']), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1'])]

and

[(u'1', [u'3', u'4']), (u'0', [u'1', u'2'])]

How can I merge the two and get the following:

[(u'1', [u'0',u'3', u'4']]), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1']),(u'0', [u'1', u'2'])]

I tried the join command but but that did not give me the result that I was looking for. Any help is much appreciated.

like image 246
ahajib Avatar asked Mar 15 '17 02:03

ahajib


People also ask

How do I join RDD in PySpark?

PySpark pair RDD – leftOuterJoin() leftOuterJoin() is used to perform left join on pair RDD. left Join results in the RDD by selecting all rows from the first RDD and only matched rows from the second RDD with respect to the rows in the first RDD.

How do I change RDD to Dataframe in PySpark?

Method 1: Using createDataframe() function. After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the RDD and defined schema for Dataframe.


1 Answers

I solved it using:

rdd2.union(rdd1).reduceByKey(lambda x,y : x+y)

None of the following worked for me:

(rdd1 union rdd2).reduceByKey(_ ++ _)

or

rdd1.join(rdd2).map(case (k, (ls, rs)) => (k, ls ++ rs))

Best of luck to everyone.

like image 94
ahajib Avatar answered Nov 09 '22 14:11

ahajib