I am running a pipeline to process my data on Spark. It seems like my Executors die every now and then when they reach near the Storage Memory limit. The job continues and eventually finishes but is this the normal behaviour? Is there something I should be doing to prevent this from happening? Every time this happens the job hangs for some time until (and I am guessing here) YARN provides some new executors for the job to continue.
In-memory the storage provided by executors for Spark RDD and are cached by programs by the user with block manager. Run the application for a complete lifespan, by which executor’s static allocation is inferred. How Apache Spark Executor Works? The executor starts over an application every time it gets an event.
Troubleshooting hundreds of Spark Jobs in recent time has realized me that Fetch Failed Exception mainly comes due to the following reasons: 1 Out of Heap memory on Executors 2 Low Memory Overhead on Executors 3 Shuffle block greater than 2 GB 4 Network TimeOut. More ...
It is possible to have as many spark executors as data nodes, also can have as many cores as you can get from the cluster mode. We can describe executors by their id, hostname, environment (as SparkEnv), and classpath.
There are some conditions in which we create executor, such as: When CoarseGrainedExecutorBackend receives RegisteredExecutor message. Only for Spark Standalone and YARN. While Mesos’s MesosExecutorBackend registered on spark. When LocalEndpoint is created for local mode.
I think this turned out to be related with a Yarn bug. It doesn't happen anymore after I set the following YARN options like suggested in section 4. of this blog post
Best practice 5: Always set the virtual and physical memory check flag to false.
"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With