While coding the Spark programs i came across this toLocalIterator()
method. As earlier i was using only iterator()
method.
If anyone has ever used this method please throw some lights.
I came across while using foreach
and foreachPartition
methods in Spark program.
Can I pass the foreach
method result to toLocalIterator
method or vice verse.
toLocalIterator() -> foreachPartition()
iterator() -> foreach()
First of all, the iterator
method from an RDD should not be called. As you can read in the [Javadocs](https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#iterator(org.apache.spark.Partition, org.apache.spark.TaskContext)): This should ''not'' be called by users directly, but is available for implementors of custom subclasses of RDD.
As for the toLocalIterator
, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect
method, but instead of returning a List
it will return an Iterator
.
foreach
is used to apply a function to each of the elements of the RDD, while foreachPartition
is to apply a function to each of the partitions. In the first approach you get one element at a time (to parallelize more) and in the second one you get the whole partition (if you need to perform an operation with all the data).
So yes, after applying a function to an RDD using foreach
or foreachPartition
you can call toLocalIterator
to get an iterator with all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext
to parallelize it again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With