Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

I launched a spark job with these settings (among others):

spark.driver.maxResultSize  11GB
spark.driver.memory         12GB

I was debugging my pyspark job, and it kept giving me the error:

serialized results of 16 tasks (17.4 GB) is bigger than spark.driver.maxResultSize (11 GB)

So, I increased the spark.driver.maxResultSize to 18 G in the configuration settings. And, it worked!!

Now, this is interesting because in both cases the spark.driver.memory was SMALLER than the serialized results returned.

Why is this allowed? I would assume this not to be possible because the serialized results were 17.4 GB when I was debugging, which is more than the size of the driver, which is 12 GB, as shown above?

How is this possible?

like image 985
makansij Avatar asked Jul 17 '16 01:07

makansij


1 Answers

It is possible because spark.driver.memory configures JVM driver process not Python interpreter and data between them is transferred with sockets and driver process don't have to keep all data in memory (don't convert to local structure).

like image 105
user6022341 Avatar answered Nov 09 '22 04:11

user6022341