Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ?
I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any performance impact ?
Original:
val QuoteHashMap=QuoteRDD.collect().toMap
val QuoteRDDData=QuoteHashMap.values.toSeq
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")","")))
QuoteRDDSet.saveAsTextFile(Quotepath)
Change:
val QuoteHashMap=QuoteRDD.collectAsMap()
val QuoteRDDData=QuoteHashMap.values.toSeq
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")","")))
QuoteRDDSet.saveAsTextFile(Quotepath)
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
collect() action function is used to retrieve all elements from the dataset (RDD/DataFrame/Dataset) as a Array[Row] to the driver program. collectAsList() action function is similar to collect() but it returns Java util list.
collect without parameters fetches all data stored in a RDD to the driver. Return an array that contains all of the elements in this RDD. This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
The implementation of collectAsMap
is the following
def collectAsMap(): Map[K, V] = self.withScope {
val data = self.collect()
val map = new mutable.HashMap[K, V]
map.sizeHint(data.length)
data.foreach { pair => map.put(pair._1, pair._2) }
map
}
Thus, there is no performance difference between collect
and collectAsMap
, because collectAsMap
calls under the hood also collect
.
No difference. Avoid using collect() as much as you can as it destroys the concept of parallelism and collects the data on the driver.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With