Spark CollectAsMap

Tags:

I would like to know how collectAsMap works in Spark. More specifically I would like to know where the aggregation of the data of all partitions will take place? The aggregation either takes place in master or in workers. In the first case each worker send its data on master and when the master collects the data from each one worker, then master will aggregate the results. In the second case the workers are responsible to aggregate the results(after they exchange the data among them) and after that the results will be sent to the master.

It is critical for me to find a way so as the master to be able collect the data from each partition separately, without workers exchange data.

313

asked Apr 22 '15 18:04

Χρήστος Μάλλιος

3 Answers

First of all in both the operations, all of your data which is present in RDD will travel from different executors/workers to Master/Driver. Both collect and collectAsMap will just collate the data from various executors/workers. SO this is why it is always recommended Not to use collect until and unless you don't have any other option.

I must say, this is the last collection one must consider from performance point of view.

collect : will return the results as an Array.
collectAsMap will return the results for paired RDD as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed.

Regards,

Neeraj

answered Oct 18 '22 20:10

Neeraj Bhadani

Supporting to the above answers:

collectAsMap() - returns the key-value pairs as dictionary (countByKey() is another function which return dictionary.)

collectAsMap(), Collect(), take(n), takeOrdered(n), takeSample(False,..)

These methods brings all the data to the driver. Programmer need to take precaustion while using them in production.

answered Oct 18 '22 22:10

Krish

You can see how they are doing collectAsMap here. Since the RDD type is a tuple it looks like they just use the normal RDD collect and then translate the tuples into a map of key,value pairs. But they do mention in the comment that multi-map isn't supported, so you need a 1-to-1 key/value mapping across your data.

collectAsMap function

What collect does is execute a Spark job and get back the results from each partition from the workers and aggregates them with a reduce/concat phase on the driver.

collect function

So given that, it should be the case that the driver collects the data from each partition separately without workers exchanging data to perform collectAsMap.

Note, if you are doing transformations on your RDD prior to using collectAsMap that cause a shuffle to occur, there may be an intermediate step that causes workers to exchange data amongst themselves. Check out your cluster master's application UI to see more information regarding how spark is executing your application.

answered Oct 18 '22 22:10

Rich

Related questions
                            
                                DataFrame fail to find the column name after join condition
                            
                                How to compare two datasets?
                            
                                Using Apache Spark as a backend for web application [closed]
                            
                                Using DataFrame with MLlib
                            
                                Iterating through a Spark RDD
                            
                                Livy Server on Amazon EMR hangs on Connecting to ResourceManager
                            
                                Which HBase connector for Spark 2.0 should I use? [closed]
                            
                                Exporting spark dataframe to .csv with header and specific filename
                            
                                How does Spark paralellize slices to tasks/executors/workers?
                            
                                Standalone spark cluster. Can't submit job programmatically -> java.io.InvalidClassException
                            
                                hadoop writables NotSerializableException with Apache Spark API
                            
                                Access public available Amazon S3 file from Apache Spark
                            
                                how can I access spark javadoc or sources from java project?
                            
                                How to extract a value from a Vector in a column of a Spark Dataframe [duplicate]
                            
                                pyspark add new row to dataframe
                            
                                How to handle small file problem in spark structured streaming?
                            
                                How to mock inner call to pyspark sql function
                            
                                Is Apache Spark good for lots of small, fast computations and a few big, non-interactive ones?
                            
                                spark graphx: how to travers a graph to create a graph of second degree neighbors
                            
                                Running Spark on YARN in yarn-cluster mode: Where does the console output go?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark CollectAsMap

Tags:

distributed-computing

apache-spark

worker