While coding the Spark programs i came across this toLocalIterator() method. As earlier i was using only iterator() method.
If anyone has ever used this method please throw some lights.
I came across while using foreach and foreachPartition methods in Spark program.
Can I pass the foreach method result to toLocalIterator method or vice verse.
toLocalIterator() -> foreachPartition()
iterator() -> foreach()
First of all, the iterator method from an RDD should not be called. As you can read in the [Javadocs](https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#iterator(org.apache.spark.Partition, org.apache.spark.TaskContext)): This should ''not'' be called by users directly, but is available for implementors of custom subclasses of RDD.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
foreach is used to apply a function to each of the elements of the RDD, while foreachPartition is to apply a function to each of the partitions. In the first approach you get one element at a time (to parallelize more) and in the second one you get the whole partition (if you need to perform an operation with all the data).
So yes, after applying a function to an RDD using foreach or foreachPartition you can call toLocalIterator to get an iterator with all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With