Why does SparkContext.parallelize use memory of the driver?

Question

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node. Here's an example of my code:

# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)

I've tried

del a

to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.

After I create rdd_a, how can I destroy a to free the master node's memory?

Thanks!

Jacek Laskowski · Accepted Answer

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.

Quoting the scaladoc of parallelize

parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.

Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.

In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.

That's why parallelize is only meant for small datasets to play around (and mainly for demos).

Joe C · Answer

The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.

Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.

Why does SparkContext.parallelize use memory of the driver?

Tags:

apache-spark

pyspark

PengNi

2 Answers

Jacek Laskowski

Joe C

Recent Activity

Donate For Us

Why does SparkContext.parallelize use memory of the driver?

Tags:

apache-spark

pyspark

PengNi

2 Answers

Jacek Laskowski

Joe C

Related questions

Recent Activity

Donate For Us