Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

collect RDD with buffer in pyspark

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an error.

Is there a way to recreate the collect() operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take() 100000 rows at a time from a cached RDD, but I don't think take() allows you to specify a start position?

like image 400
mgoldwasser Avatar asked Nov 19 '15 19:11

mgoldwasser


1 Answers

The best available option is to use RDD.toLocalIterator which collects only a single partition at the time. It creates a standard Python generator:

rdd = sc.parallelize(range(100000))
iterator = rdd.toLocalIterator()
type(iterator)

## generator

even = (x for x in iterator if not x % 2)

You can adjust amount of data collected in a single batch using a specific partitioner and adjusting a number of partitions.

Unfortunately it comes with a price. To collect small batches you have to start multiple Spark jobs and it is quite expensive. So generally speaking collecting an element at the time is not an option.

like image 61
zero323 Avatar answered Oct 06 '22 01:10

zero323