Is it possible in Spark to implement '.combinations' function from scala collections?
/** Iterates over combinations. * * @return An Iterator which traverses the possible n-element combinations of this $coll. * @example `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)` */
For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Creating from another RDDYou can use transformations like map, flatmap, filter to create a new RDD from an existing one. Above, creates a new RDD “rdd3” by adding 100 to each record on RDD.
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd. count()” will give the result 8.
Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd.size() ^ 2
and combinations will create an RDD of size rdd.size() choose 2
val rdd = sc.parallelize(1 to 5) val combinations = rdd.cartesian(rdd).filter{ case (a,b) => a < b }`. combinations.collect()
Note this will only work if an ordering is defined on the elements of the list, since we use <
. This one only works for choosing two but can easily be extended by making sure the relationship a < b
for all a and b in the sequence
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With