I have two rdds that I need to join them together. They look like the followings: RDD1 <pre class="prettyprint"><code>[(u'2', u'100', 2), (u'1', u'300', 1), (u'1', u'200', 1)] </code></pre> RDD2 <pre class="prettyprint"><code>[(u'1', u'2'), (u'1', u'3')] </code></pre> My desired output is: <pre class="prettyprint"><code>[(u'1', u'2', u'100', 2)] </code></pre> So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys. Thanks

Dataframe If you allow using Spark Dataframe in the solution. You can turn given RDD to dataframes and join the corresponding column together. <pre class="prettyprint"><code>df1 = spark.createDataFrame(rdd1, schema=['a', 'b', 'c']) df2 = spark.createDataFrame(rdd2, schema=['d', 'a']) rdd_join = df1.join(df2, on='a') out = rdd_join.rdd.collect() </code></pre> RDD just zip the key that you want to join to the first element and simply use <code>join</code> to do the joining <pre class="prettyprint"><code>rdd1_zip = rdd1.map(lambda x: (x[0], (x[1], x[2]))) rdd2_zip = rdd2.map(lambda x: (x[1], x[0])) rdd_join = rdd1_zip.join(rdd2_zip) rdd_out = rdd_join.map(lambda x: (x[0], x[1][0][0], x[1][0][1], x[1][1])).collect() # flatten the rdd print(rdd_out) </code></pre>

pyspark join rdds by a specific key

Tags:

join

rdd

pyspark

I have two rdds that I need to join them together. They look like the followings:

RDD1

[(u'2', u'100', 2),
 (u'1', u'300', 1),
 (u'1', u'200', 1)]

RDD2

[(u'1', u'2'), (u'1', u'3')]

My desired output is:

[(u'1', u'2', u'100', 2)]

So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys.

Thanks

889

asked Mar 15 '17 22:03

dagg3r

1 Answers

Dataframe If you allow using Spark Dataframe in the solution. You can turn given RDD to dataframes and join the corresponding column together.

df1 = spark.createDataFrame(rdd1, schema=['a', 'b', 'c'])
df2 = spark.createDataFrame(rdd2, schema=['d', 'a'])
rdd_join = df1.join(df2, on='a')
out = rdd_join.rdd.collect()

RDD just zip the key that you want to join to the first element and simply use join to do the joining

rdd1_zip = rdd1.map(lambda x: (x[0], (x[1], x[2])))
rdd2_zip = rdd2.map(lambda x: (x[1], x[0]))
rdd_join = rdd1_zip.join(rdd2_zip)
rdd_out = rdd_join.map(lambda x: (x[0], x[1][0][0], x[1][0][1], x[1][1])).collect() # flatten the rdd
print(rdd_out)

141

answered Oct 13 '22 02:10

titipata

Related questions
                            
                                #1221 - Incorrect usage of UPDATE and ORDER BY
                            
                                Join two DataFrames where the join key is different and only select some columns
                            
                                spark: How does salting work in dealing with skewed data
                            
                                inner joins in oracle
                            
                                codeigniter multiple join conditions on same table
                            
                                Rails 3 Returning All Columns from a Join
                            
                                SQL Server : does order of full outer join matter?
                            
                                Determining which columns to index in MySQL in CakePHP
                            
                                Select null on a join where a record doesn't exist in another table
                            
                                Concatenate strings based on inner join
                            
                                Speed up a query, simple inner join with one large table and one small table
                            
                                A better way to return '1' if a left join returns any rows?
                            
                                Select all projects that have matching tags
                            
                                Optimizing MySQL query with expensive INNER JOIN
                            
                                Join two queries into one
                            
                                Join two files using awk
                            
                                How to handle join query in Hibernate and Spring with annotations?
                            
                                IF/ELSE LEFT JOIN
                            
                                Multiple on clause in LINQ to DataTable Join Query
                            
                                python pandas dataframe join two dataframes [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With