Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Cluster, failed to connect to master. (WARN Worker: Failed to connect to master)

Tags:

apache-spark

I have a spark cluster with 2 nodes, master(172.17.0.229) and slave(172.17.0.228). I have edited spark-env.sh, added SPARK_MASTER_IP=127.17.0.229 and slaves, added 172.17.0.228.

I am starting my master node using start-master.sh and slave node using start-slaves.sh.

I can see the webUI with a master node with no worker, but the log of worker node is as:

Spark Command: /usr/lib/jvm/java-7-oracle/jre/bin/java -cp /usr/local/src/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/usr/local/src/spark-1.5.2-bin-hadoop$
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/18 14:17:25 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/12/18 14:17:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/18 14:17:26 INFO SecurityManager: Changing view acls to: ujjwal
15/12/18 14:17:26 INFO SecurityManager: Changing modify acls to: ujjwal
15/12/18 14:17:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ujjwal); users wit$
15/12/18 14:17:27 INFO Slf4jLogger: Slf4jLogger started
15/12/18 14:17:27 INFO Remoting: Starting remoting
15/12/18 14:17:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:47599]
15/12/18 14:17:27 INFO Utils: Successfully started service 'sparkWorker' on port 47599.
15/12/18 14:17:27 INFO Worker: Starting Spark worker 172.17.0.228:47599 with 2 cores, 2.7 GB RAM
15/12/18 14:17:27 INFO Worker: Running Spark version 1.5.2
15/12/18 14:17:27 INFO Worker: Spark home: /usr/local/src/spark-1.5.2-bin-hadoop2.6
15/12/18 14:17:27 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/12/18 14:17:27 INFO WorkerWebUI: Started WorkerWebUI at http://172.17.0.228:8081
15/12/18 14:17:27 INFO Worker: Connecting to master 127.17.0.229:7077...
15/12/18 14:17:27 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:7077] has failed, address is now$
15/12/18 14:17:27 WARN Worker: Failed to connect to master 127.17.0.229:7077
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://[email protected]:7077/), Path(/user/Master)]
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
        at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
        at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
        at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:533)
        at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:569)
        at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:559)
        at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
        at akka.remote.EndpointWriter.postStop(Endpoint.scala:557)
        at akka.actor.Actor$class.aroundPostStop(Actor.scala:477)
        at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:411)
        at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
        at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
        at akka.actor.ActorCell.terminate(ActorCell.scala:369)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)

Thanks for your suggestion.

like image 644
user3180835 Avatar asked Dec 25 '22 11:12

user3180835


2 Answers

Generally, checking the IP that your worker is trying to connect to against the reported spark://...:7077 address on the web UI at 172.17.0.229 port 8080 will help identify whether the address is correct.

In this particular case, it looks like you have a typo; change

SPARK_MASTER_IP=127.17.0.229

to read:

SPARK_MASTER_IP=172.17.0.229

(you seem to have 127/172 inverted).

like image 102
Dennis Huo Avatar answered May 24 '23 13:05

Dennis Huo


My issue was a version mismatch between the spark java library I was using (2.0.0) and the version of the spark cluster (2.2.1)

like image 32
Michael Avatar answered May 24 '23 11:05

Michael