Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ungrouping a (key, list(values)) pair in Spark/Scala

I have data formatted in the following way:

DataRDD = [(String, List[String])]

The first string indicates the key and the list houses the values. Note that the number of values is different for each key (but is never zero). I am looking to map the RDD in such a way that there will be a key, value pair for each element in the list. To clarify this, imagine the whole RDD as the following list:

DataRDD = [(1, [a, b, c]), 
           (2, [d, e]),
           (3, [a, e, f])]

Then I would like the result to be:

DataKV  = [(1, a),
           (1, b),
           (1, c),
           (2, d),
           (2, e),
           (3, a),
           (3, e),
           (3, f)]

Consequently, I would like to return all combinations of keys which have identical values. This may be returned into a list for each key, even when there are no identical values:

DataID  = [(1, [3]),
           (2, [3]),
           (3, [1, 2])]

Since I'm fairly new to Spark and Scala I have yet to fully grasp their concepts, as such I hope any of you can help me. Even if it's just a part of this.

like image 245
Remy Kabel Avatar asked Nov 18 '14 06:11

Remy Kabel


1 Answers

This is definitely a newbie question that often times comes up. The solution is to use flatMapValues

val DataRDD = sc.parallelize(Array((1, Array("a", "b", "c")), (2, Array("d", "e")),(3, Array("a", "e", "f"))))

DataRDD.flatMapValues(x => x).collect

Which will give the desired solution

Array((1,a), (1,b), (1,c), (2,d), (2,e), (3,a), (3,e), (3,f))
like image 93
Oscar Avatar answered Oct 26 '22 09:10

Oscar