Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark never finishes jobs and stages, JobProgressListener crash

Tags:

apache-spark

We have a Spark application that process continuously a lot of incoming jobs. Several jobs are processed in parallel, on multiple threads.

During intensive workloads, at some point, we start to have this kind of warnings :

16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379
16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 64610
16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 147405
16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406
16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 64622

Starting from that, the performance of the app plummet, most of Stages and Jobs never finish. On SparkUI, I can see figures like 13000 pending/active jobs.

I can't see clearly another exception happening before with more info. Maybe this one, but it concerns another listener :

16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since Thu Jan 01 01:00:00 CET 1970

This is a very annoying problem, because it's not a clear crash, or clear ERROR message we could catch to relaunch the app.

UPDATE:

  • Problem occurs with Spark 2.0.2 and Spark 2.1.1
  • Most probably related to SPARK-18838

What bugs me most is that I would expect this to happen on large configurations (a large cluster would DDOS the driver with task results more easily), but it's not the case. Our cluster is kind of small, the only particularity is that we tend to have a mix of small and large files to process, and small files generate many tasks that finish quickly.

like image 665
mathieu Avatar asked Dec 14 '16 21:12

mathieu


1 Answers

I may have found a workaround :

Changing value of spark.scheduler.listenerbus.eventqueue.size (100000 instead of default 10000) seems to help, but it may only postpone the problem.

like image 164
mathieu Avatar answered Nov 05 '22 15:11

mathieu