Suppose
rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).
Want to generate
( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).
Any easy methods? I think it is different from the cross join but can't find a good solution. My solution is
(rdd1
.cartesian( rdd2 )
.filter( lambda (k, v): k[0]==v[0] )
.map( lambda (k, v): (k[0], (k[1], v[1])) ))
join(other, numPartitions = None) It returns RDD with a pair of elements with the matching keys and all the values for that particular key. In the following example, there are two pair of elements in two different RDDs. After joining these two RDDs, we get an RDD with elements having matching keys and their values.
rightOuterJoin(): Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k.
Creating a pair RDD using the first word as the key in Java. PairFunction < String , String , String > keyData = new PairFunction < String , String , String >() { public Tuple2 < String , String > call ( String x ) { return new Tuple2 ( x . split ( " " )[ 0 ], x ); } }; JavaPairRDD < String , String > pairs = lines .
You are just looking for a simple join, e.g.
rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With