Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

Good day.

I am running a development code for parsing some log files. My code will run smoothly if I tried to parse less files. But as I increase the number of log files I need to parse, it will return different errors such as too many open files and Total size of serialized results of tasks is bigger than spark.driver.maxResultSize.

I tried to increase the spark.driver.maxResultSize but the error still persists.

Can you give me any ideas on how to resolve this issue?

Thanks.

Sample Error

like image 379
Reijay Avatar asked Oct 16 '17 04:10

Reijay


Video Answer


1 Answers

Total size of serialized results of tasks is bigger than spark.driver.maxResultSize means when a executor is trying to send its result to driver, it exceeds spark.driver.maxResultSize. Possible solution is as mentioned above by @mayank agrawal to keep on increasing it till you get it to work (not a recommended solution if an executor is trying to send too much data ).

I would suggest looking into your code and see if the data is skewed that is making one of the executor to do most of the work resulting in a lot of data in/out. If data is skewed you could try repartitioning it.

for too many open files issues , possible cause is Spark might be creating a number of intermediate files before shuffle. could happen if too many cores being used in executor/high parallelism or unique keys (possible cause in your case - huge number of input files). One solution to look into is consolidating the huge number of intermediate files through this flag : --conf spark.shuffle.consolidateFiles=true (when you do spark-submit)

One more thing to check is this thread (if that something similar to your use case): https://issues.apache.org/jira/browse/SPARK-12837

like image 64
joshi.n Avatar answered Sep 22 '22 13:09

joshi.n