Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between rdd.collect().toMap to rdd.collectAsMap()?

Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ?

I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any performance impact ?

Original:

val QuoteHashMap=QuoteRDD.collect().toMap 
val QuoteRDDData=QuoteHashMap.values.toSeq 
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) 
QuoteRDDSet.saveAsTextFile(Quotepath) 

Change:

val QuoteHashMap=QuoteRDD.collectAsMap() 
val QuoteRDDData=QuoteHashMap.values.toSeq 
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) 
QuoteRDDSet.saveAsTextFile(Quotepath)
like image 756
sri hari kali charan Tummala Avatar asked Oct 20 '15 09:10

sri hari kali charan Tummala


People also ask

What is RDD collect?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

What is collect action in Spark?

collect() action function is used to retrieve all elements from the dataset (RDD/DataFrame/Dataset) as a Array[Row] to the driver program. collectAsList() action function is similar to collect() but it returns Java util list.

What does. collect do?

collect without parameters fetches all data stored in a RDD to the driver. Return an array that contains all of the elements in this RDD. This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.


2 Answers

The implementation of collectAsMap is the following

def collectAsMap(): Map[K, V] = self.withScope {
    val data = self.collect()
    val map = new mutable.HashMap[K, V]
    map.sizeHint(data.length)
    data.foreach { pair => map.put(pair._1, pair._2) }
    map
  }

Thus, there is no performance difference between collect and collectAsMap, because collectAsMap calls under the hood also collect.

like image 156
Till Rohrmann Avatar answered Oct 28 '22 08:10

Till Rohrmann


No difference. Avoid using collect() as much as you can as it destroys the concept of parallelism and collects the data on the driver.

like image 34
Meet Vadera Avatar answered Oct 28 '22 09:10

Meet Vadera