I have data formatted in the following way:
DataRDD = [(String, List[String])]
The first string indicates the key and the list houses the values. Note that the number of values is different for each key (but is never zero). I am looking to map the RDD in such a way that there will be a key, value pair for each element in the list. To clarify this, imagine the whole RDD as the following list:
DataRDD = [(1, [a, b, c]),
(2, [d, e]),
(3, [a, e, f])]
Then I would like the result to be:
DataKV = [(1, a),
(1, b),
(1, c),
(2, d),
(2, e),
(3, a),
(3, e),
(3, f)]
Consequently, I would like to return all combinations of keys which have identical values. This may be returned into a list for each key, even when there are no identical values:
DataID = [(1, [3]),
(2, [3]),
(3, [1, 2])]
Since I'm fairly new to Spark and Scala I have yet to fully grasp their concepts, as such I hope any of you can help me. Even if it's just a part of this.
This is definitely a newbie question that often times comes up. The solution is to use flatMapValues
val DataRDD = sc.parallelize(Array((1, Array("a", "b", "c")), (2, Array("d", "e")),(3, Array("a", "e", "f"))))
DataRDD.flatMapValues(x => x).collect
Which will give the desired solution
Array((1,a), (1,b), (1,c), (2,d), (2,e), (3,a), (3,e), (3,f))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With