We have a Spark application that process continuously a lot of incoming jobs. Several jobs are processed in parallel, on multiple threads.
During intensive workloads, at some point, we start to have this kind of warnings :
16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379
16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 64610
16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 147405
16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406
16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 64622
Starting from that, the performance of the app plummet, most of Stages and Jobs never finish. On SparkUI, I can see figures like 13000 pending/active jobs.
I can't see clearly another exception happening before with more info. Maybe this one, but it concerns another listener :
16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since Thu Jan 01 01:00:00 CET 1970
This is a very annoying problem, because it's not a clear crash, or clear ERROR message we could catch to relaunch the app.
UPDATE:
What bugs me most is that I would expect this to happen on large configurations (a large cluster would DDOS the driver with task results more easily), but it's not the case. Our cluster is kind of small, the only particularity is that we tend to have a mix of small and large files to process, and small files generate many tasks that finish quickly.
I may have found a workaround :
Changing value of spark.scheduler.listenerbus.eventqueue.size
(100000 instead of default 10000) seems to help, but it may only postpone the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With