I did not configure any timeout value but used default settings. Where to configure 3600 seconds timeout? How to solve it?
Error message:
18/01/10 13:51:44 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [3600 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:738)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:767)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:767)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:767)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:767)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [3600 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.
heartbeatInterval is the interval at executor reports its heartbeats to driver. So in case if GC is taking more time in executor then spark.network. timeout should help driver waiting to get response from executor before it marked it as lost and start new.
To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/spark-env.sh, the default value is 2g, and then restart shuffle to make the change take effect.
According to the recommendations which we discussed above: Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30. Leaving 1 executor for ApplicationManager => --num-executors = 29. Number of executors per node = 30/10 = 3. Memory per executor = 64GB/3 = 21GB.
In the error message it says:
This timeout is controlled by spark.executor.heartbeatInterval
Hence, the first thing you try is increasing this value. It can be done in multiple ways, for example increasing the value to 10000 seconds:
When using spark-submit
simply add the flag:
--conf spark.executor.heartbeatInterval=10000s
You can add a line in spark-defaults.conf:
spark.executor.heartbeatInterval 10000s
When creating a new SparkSession
in your program, add a config parameter (Scala):
val spark = SparkSession.builder
.config("spark.executor.heartbeatInterval", "10000s")
.getOrCreate()
If this does not help, it could be a good idea to try increasing the value of spark.network.timeout
as well. It is also a common source for problem related to these types of timeouts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With