Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Scala RDD to Map

I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18] and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId' but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]' but I want a 'Map[String, Long]' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?

Thanks

like image 553
Soumitra Avatar asked Oct 14 '14 01:10

Soumitra


2 Answers

There's a built-in collectAsMap function in PairRDDFunctions that would deliver you a map of the pair values in the RDD.

val vertexMAp = vertices.zipWithUniqueId.collectAsMap

It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.

From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.

You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookup on a PairRDD instead, like this:

import org.apache.spark.SparkContext._  // import implicits conversions to support PairRDDFunctions

val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")
like image 98
maasg Avatar answered Oct 13 '22 02:10

maasg


Collect to "local" machine and then convert Array[(String, Long)] to Map

val rdd: RDD[String] = ???

val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap
like image 44
Eugene Zhulenev Avatar answered Oct 13 '22 02:10

Eugene Zhulenev