I would like to know how collectAsMap works in Spark. More specifically I would like to know where the aggregation of the data of all partitions will take place? The aggregation either takes place in master or in workers. In the first case each worker send its data on master and when the master collects the data from each one worker, then master will aggregate the results. In the second case the workers are responsible to aggregate the results(after they exchange the data among them) and after that the results will be sent to the master.
It is critical for me to find a way so as the master to be able collect the data from each partition separately, without workers exchange data.
collectAsMap. Returns the pair RDD as a Map to the Spark Master. countByKey. Returns the count of each key elements. This returns the final result to local Map which is your driver.
Spark Paired RDDs are defined as the RDD containing a key-value pair. There is two linked data item in a key-value pair (KVP). We can say the key is the identifier, while the value is the data corresponding to the key value. In addition, most of the Spark operations work on RDDs containing any type of objects.
Return an RDD containing all pairs of elements with matching keys in self and other . Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other . Performs a hash join across the cluster.
For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
First of all in both the operations, all of your data which is present in RDD will travel from different executors/workers to Master/Driver. Both collect and collectAsMap will just collate the data from various executors/workers. SO this is why it is always recommended Not to use collect until and unless you don't have any other option.
I must say, this is the last collection one must consider from performance point of view.
Regards,
Neeraj
Supporting to the above answers:
collectAsMap()
- returns the key-value pairs as dictionary (countByKey()
is another function which return dictionary.)
collectAsMap()
, Collect()
, take(n)
, takeOrdered(n)
, takeSample(False,..)
These methods brings all the data to the driver. Programmer need to take precaustion while using them in production.
You can see how they are doing collectAsMap here. Since the RDD type is a tuple it looks like they just use the normal RDD collect and then translate the tuples into a map of key,value pairs. But they do mention in the comment that multi-map isn't supported, so you need a 1-to-1 key/value mapping across your data.
collectAsMap function
What collect does is execute a Spark job and get back the results from each partition from the workers and aggregates them with a reduce/concat phase on the driver.
collect function
So given that, it should be the case that the driver collects the data from each partition separately without workers exchanging data to perform collectAsMap
.
Note, if you are doing transformations on your RDD prior to using collectAsMap
that cause a shuffle to occur, there may be an intermediate step that causes workers to exchange data amongst themselves. Check out your cluster master's application UI to see more information regarding how spark is executing your application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With