Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

which is faster in spark, collect() or toLocalIterator()

Tags:

apache-spark

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?

like image 705
AMANDEEP SINGH Avatar asked Jun 03 '17 21:06

AMANDEEP SINGH


People also ask

What is difference between collect and take in spark?

collect() shows content and structure/metadata. e.g. df. take(some number) can be used to shows content and structure/metadata for a limited number of rows for a very large dataset.

What does spark collect () do?

PySpark Collect() – Retrieve data from DataFrame Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.


1 Answers

The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the dataframe and make a service calls in batches of 100. In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.

On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.

Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.

like image 57
Kaushik Ghosh Avatar answered Sep 21 '22 14:09

Kaushik Ghosh