Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intermittent Timeout Exception using Spark

I've a Spark cluster with 10 nodes, and I'm getting this exception after using the Spark Context for the first time:

14/11/20 11:15:13 ERROR UserGroupInformation: PriviledgedActionException as:iuberdata (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    ... 4 more

This guy have had a similar problem but I've already tried his solution and didn't worked.

The same exception also happens here but the problem isn't them same in here as I'm using spark version 1.1.0 in both master or slave and in client.

I've tried to increase the timeout to 120s but it still doesn't solve the problem.

I'm doploying the environment throught scripts and I'm using the context.addJar to include my code into the classpath. This problem is intermittend, and I don't have any idea on how to track why is it happening. Anybody has faced this issue when configuring a spark cluster know how to solve it?

like image 232
dirceusemighini Avatar asked Nov 20 '14 12:11

dirceusemighini


1 Answers

We had a similar problem which was quite hard to debug and isolate. Long story short - Spark uses Akka which is very picky about FQDN hostnames resolving to IP addresses. Even if you specify the IP Address at all places it is not enough. The answer here helped us isolate the problem.

A useful test to run is run netcat -l <port> on the master and run nc -vz <host> <port> on the worker to test the connectivity. Run the test with an IP address and with the FQDN. You can get the name Spark is using from the WARN message from the log snippet below. For us it was host032s4.staging.companynameremoved.info. The IP address test for us passed and the FQDN test failed as our DNS was not setup correctly.

INFO 2015-07-24 10:33:45 Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:35455]
INFO 2015-07-24 10:33:45 Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:35455]
INFO 2015-07-24 10:33:45 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 35455.
WARN 2015-07-24 10:33:45 Remoting: Tried to associate with unreachable remote address [akka.tcp://[email protected]:50855]. Address is now gated for 60000 ms, all messages to this address will be delivered to dead letters.
ERROR 2015-07-24 10:34:15 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:skumar cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

Another thing which we had to do was to specify the spark.driver.host and spark.driver.port properties in the spark submit script. This was because we had machines with two IP addresses and the FQDN resolved to the wrong IP address.

Make sure your network and DNS entries are correct!!

like image 85
Saket Avatar answered Sep 20 '22 12:09

Saket