I have two rdd's which both are result of a groupby and look like: <pre class="prettyprint"><code>[(u'1', [u'0']), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1'])] </code></pre> and <pre class="prettyprint"><code>[(u'1', [u'3', u'4']), (u'0', [u'1', u'2'])] </code></pre> How can I merge the two and get the following: <pre class="prettyprint"><code>[(u'1', [u'0',u'3', u'4']]), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1']),(u'0', [u'1', u'2'])] </code></pre> I tried the join command but but that did not give me the result that I was looking for. Any help is much appreciated.

I solved it using: <pre class="prettyprint"><code>rdd2.union(rdd1).reduceByKey(lambda x,y : x+y) </code></pre> None of the following worked for me: <pre class="prettyprint"><code>(rdd1 union rdd2).reduceByKey(_ ++ _) </code></pre> or <pre class="prettyprint"><code>rdd1.join(rdd2).map(case (k, (ls, rs)) => (k, ls ++ rs)) </code></pre> Best of luck to everyone.

pyspark merge two rdd together

Tags:

python

apache-spark

rdd

pyspark

I have two rdd's which both are result of a groupby and look like:

[(u'1', [u'0']), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1'])]

and

[(u'1', [u'3', u'4']), (u'0', [u'1', u'2'])]

How can I merge the two and get the following:

[(u'1', [u'0',u'3', u'4']]), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1']),(u'0', [u'1', u'2'])]

I tried the join command but but that did not give me the result that I was looking for. Any help is much appreciated.

246

asked Mar 15 '17 02:03

ahajib

1 Answers

I solved it using:

rdd2.union(rdd1).reduceByKey(lambda x,y : x+y)

None of the following worked for me:

(rdd1 union rdd2).reduceByKey(_ ++ _)

rdd1.join(rdd2).map(case (k, (ls, rs)) => (k, ls ++ rs))

Best of luck to everyone.

answered Nov 09 '22 14:11

ahajib

Related questions
                            
                                Connect to a different database in django shell
                            
                                "-bash: python2: command not found" on OS X
                            
                                Dynamically generate Flask routes
                            
                                How to turn off autoscaling in matplotlib.pyplot
                            
                                Changing iterable variable during loop
                            
                                How to call all functions with name starting with given prefix?
                            
                                jupyter notebook starting directory
                            
                                NaN from sparse_softmax_cross_entropy_with_logits in Tensorflow
                            
                                Precise nth root
                            
                                Pandas cast all object columns to category
                            
                                Vertical scrollbar for frame in Tkinter, Python
                            
                                Running Python from CLion gives "Processed finished with exit code 127"
                            
                                Pyspark read multiple csv files into a dataframe (OR RDD?)
                            
                                Pandas - Creating a New Column
                            
                                Django REST API: Make field read-only for certain permission level
                            
                                How to send image to Flask server from curl request
                            
                                Django generate csv file on view and download
                            
                                python merge set of fronzensets into one set
                            
                                a bytes-like object is required not 'str'
                            
                                Remove anaconda environment prefix from ubuntu terminal command prompt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark merge two rdd together

Tags:

python

apache-spark

rdd

pyspark

ahajib

People also ask

1 Answers

ahajib

Recent Activity

Donate For Us