I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18]
and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId
'
but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]'
but I want a 'Map[String, Long]
' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]
' ?
Thanks
There's a built-in collectAsMap
function in PairRDDFunctions
that would deliver you a map of the pair values in the RDD.
val vertexMAp = vertices.zipWithUniqueId.collectAsMap
It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect
, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.
From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.
You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookup
on a PairRDD instead, like this:
import org.apache.spark.SparkContext._ // import implicits conversions to support PairRDDFunctions
val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")
Collect to "local" machine and then convert Array[(String, Long)] to Map
val rdd: RDD[String] = ???
val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With