collect RDD with buffer in pyspark

Question

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an error.

Is there a way to recreate the collect() operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take() 100000 rows at a time from a cached RDD, but I don't think take() allows you to specify a start position?

zero323 · Accepted Answer

The best available option is to use RDD.toLocalIterator which collects only a single partition at the time. It creates a standard Python generator:

rdd = sc.parallelize(range(100000))
iterator = rdd.toLocalIterator()
type(iterator)

## generator

even = (x for x in iterator if not x % 2)

You can adjust amount of data collected in a single batch using a specific partitioner and adjusting a number of partitions.

Unfortunately it comes with a price. To collect small batches you have to start multiple Spark jobs and it is quite expensive. So generally speaking collecting an element at the time is not an option.

collect RDD with buffer in pyspark

Tags:

apache-spark

pyspark

mgoldwasser

1 Answers

zero323

Recent Activity

Donate For Us

collect RDD with buffer in pyspark

Tags:

apache-spark

pyspark

mgoldwasser

1 Answers

zero323

Related questions

Recent Activity

Donate For Us