I launched a spark job with these settings (among others):
spark.driver.maxResultSize 11GB
spark.driver.memory 12GB
I was debugging my pyspark
job, and it kept giving me the error:
serialized results of 16 tasks (17.4 GB) is bigger than spark.driver.maxResultSize (11 GB)
So, I increased the spark.driver.maxResultSize
to 18 G
in the configuration settings. And, it worked!!
Now, this is interesting because in both cases the spark.driver.memory
was SMALLER than the serialized results returned.
Why is this allowed? I would assume this not to be possible because the serialized results were 17.4 GB
when I was debugging, which is more than the size of the driver, which is 12 GB
, as shown above?
How is this possible?
It is possible because spark.driver.memory
configures JVM driver process not Python interpreter and data between them is transferred with sockets and driver process don't have to keep all data in memory (don't convert to local structure).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With