which is faster in spark, collect() or toLocalIterator()

Tags:

apache-spark

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?

705

asked Jun 03 '17 21:06

AMANDEEP SINGH

1 Answers

The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the dataframe and make a service calls in batches of 100. In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.

On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.

Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.

answered Sep 21 '22 14:09

Kaushik Ghosh

Related questions
                            
                                PySpark pandas_udfs java.lang.IllegalArgumentException error
                            
                                Parquet vs Delta format in Azure Data Lake Gen 2 store
                            
                                Spark illegal character in path
                            
                                Connect to Spark SQL via ODBC
                            
                                Spark SQL: automatic schema from csv
                            
                                Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?
                            
                                PySpark distinct().count() on a csv file
                            
                                Apache SPARK:-Nullpointer Exception on broadcast variables (YARN Cluster mode)
                            
                                Why Spark doesn't allow map-side combining with array keys?
                            
                                How can one list all csv files in an HDFS location within the Spark Scala shell?
                            
                                How to implement NOT IN for two DataFrames with different structure in Apache Spark
                            
                                Converting Map type in Case Class to StructField Type
                            
                                Reading multiple json files from Spark
                            
                                Moving Spark DataFrame from Python to Scala whithn Zeppelin
                            
                                VectorAssembler does not support the StringType type scala spark convert
                            
                                How Spark read file with underline the beginning of the file name?
                            
                                Apache Spark RDD Split "|"
                            
                                Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames
                            
                                Acessing nested columns in pyspark dataframe
                            
                                How to submit multiple Spark applications in parallel without spawning separate JVMs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With