Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does SparkContext.parallelize use memory of the driver?

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node. Here's an example of my code:

# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)

I've tried

del a

to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.

After I create rdd_a, how can I destroy a to free the master node's memory?

Thanks!

like image 575
PengNi Avatar asked Dec 03 '25 07:12

PengNi


2 Answers

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.

Quoting the scaladoc of parallelize

parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.

Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.

In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.

That's why parallelize is only meant for small datasets to play around (and mainly for demos).

like image 133
Jacek Laskowski Avatar answered Dec 06 '25 01:12

Jacek Laskowski


The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.

Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.

like image 36
Joe C Avatar answered Dec 06 '25 00:12

Joe C