Good day.
I am running a development code for parsing some log files. My code will run smoothly if I tried to parse less files. But as I increase the number of log files I need to parse, it will return different errors such as too many open files
and Total size of serialized results of tasks is bigger than spark.driver.maxResultSize
.
I tried to increase the spark.driver.maxResultSize
but the error still persists.
Can you give me any ideas on how to resolve this issue?
Thanks.
Total size of serialized results of tasks is bigger than spark.driver.maxResultSize
means when a executor is trying to send its result to driver, it exceeds spark.driver.maxResultSize
. Possible solution is as mentioned above by @mayank agrawal to keep on increasing it till you get it to work (not a recommended solution if an executor is trying to send too much data ).
I would suggest looking into your code and see if the data is skewed that is making one of the executor to do most of the work resulting in a lot of data in/out. If data is skewed you could try repartitioning
it.
for too many open files issues , possible cause is Spark might be creating a number of intermediate files before shuffle. could happen if too many cores being used in executor/high parallelism or unique keys (possible cause in your case - huge number of input files). One solution to look into is consolidating the huge number of intermediate files through this flag : --conf spark.shuffle.consolidateFiles=true
(when you do spark-submit
)
One more thing to check is this thread (if that something similar to your use case): https://issues.apache.org/jira/browse/SPARK-12837
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With