I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect()
would cause an error.
Is there a way to recreate the collect()
operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take()
100000 rows at a time from a cached RDD, but I don't think take()
allows you to specify a start position?
The best available option is to use RDD.toLocalIterator
which collects only a single partition at the time. It creates a standard Python generator:
rdd = sc.parallelize(range(100000))
iterator = rdd.toLocalIterator()
type(iterator)
## generator
even = (x for x in iterator if not x % 2)
You can adjust amount of data collected in a single batch using a specific partitioner and adjusting a number of partitions.
Unfortunately it comes with a price. To collect small batches you have to start multiple Spark jobs and it is quite expensive. So generally speaking collecting an element at the time is not an option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With