Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join two RDDs in spark with python?

Suppose

rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).

Want to generate

( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).

Any easy methods? I think it is different from the cross join but can't find a good solution. My solution is

(rdd1
 .cartesian( rdd2 )
 .filter( lambda (k, v): k[0]==v[0] )
 .map( lambda (k, v): (k[0], (k[1], v[1])) ))
like image 521
Peng Sun Avatar asked Jun 22 '15 20:06

Peng Sun


People also ask

How do I join RDD in Pyspark?

join(other, numPartitions = None) It returns RDD with a pair of elements with the matching keys and all the values for that particular key. In the following example, there are two pair of elements in two different RDDs. After joining these two RDDs, we get an RDD with elements having matching keys and their values.

Which method is used to perform a right outer join between 2 pair RDDs?

rightOuterJoin(): Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k.

How do you make a RDD pair?

Creating a pair RDD using the first word as the key in Java. PairFunction < String , String , String > keyData = new PairFunction < String , String , String >() { public Tuple2 < String , String > call ( String x ) { return new Tuple2 ( x . split ( " " )[ 0 ], x ); } }; JavaPairRDD < String , String > pairs = lines .


1 Answers

You are just looking for a simple join, e.g.

rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]
like image 51
dpeacock Avatar answered Oct 04 '22 13:10

dpeacock