Am new to spark. I have two RDD's and want to generate resulted RDD on them as below.
val rdd1 = Array(1, 2)
val rdd2 = Array(a, b, c)
val resultRDD = [(1,a), (1,b), (1,c), (2,a), (2,b), (2,c)]
Can anyone help me on what transformations or actions I need to use to generate resultRDD like above. FYI, I am writing in scala.
EDIT
Thanks. spark cartesian works for me as below.
val data = Array('a', 'b')
val rdd1 = sc.parallelize(data)
val data2 = Array(1, 2, 3)
val rdd2 = sc.parallelize(data2)
rdd1.cartesian(rdd2).foreach(println)
The input RDD does not get changed, because RDDs are immutable in nature but it produces one or more RDD by applying operations.
Creating from another RDDYou can use transformations like map, flatmap, filter to create a new RDD from an existing one. Above, creates a new RDD “rdd3” by adding 100 to each record on RDD.
RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.
def cartesian[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other.
Doc here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With