Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: produce RDD[(X, X)] of all possible combinations from RDD[X]

Tags:

Is it possible in Spark to implement '.combinations' function from scala collections?

   /** Iterates over combinations.    *    *  @return   An Iterator which traverses the possible n-element combinations of this $coll.    *  @example  `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)`    */ 

For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique.

like image 484
Eugene Zhulenev Avatar asked Oct 24 '14 23:10

Eugene Zhulenev


People also ask

How many ways RDD can be created in Spark?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Can we create another RDD from one RDD?

Creating from another RDDYou can use transformations like map, flatmap, filter to create a new RDD from an existing one. Above, creates a new RDD “rdd3” by adding 100 to each record on RDD.

What is RDD in Spark with example?

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

Which function returns a list that contains all the elements in RDD?

Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd. count()” will give the result 8.


1 Answers

Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd.size() ^ 2 and combinations will create an RDD of size rdd.size() choose 2

val rdd = sc.parallelize(1 to 5) val combinations = rdd.cartesian(rdd).filter{ case (a,b) => a < b }`. combinations.collect() 

Note this will only work if an ordering is defined on the elements of the list, since we use <. This one only works for choosing two but can easily be extended by making sure the relationship a < b for all a and b in the sequence

like image 65
aaronman Avatar answered Sep 20 '22 06:09

aaronman