I am trying to setup a Spark standalone cluster following the official documentation.
My master is on a local vm running ubuntu and I also have one worker running in the same machine. It is connecting and I am able to see its status in the WebUI of the master.
Here is the WebUi image -
But when I try to connect a slave from another machine, I am not able to do it.
This is the log message I get in the worker when I start from another machine. I have tried using start-slaves.sh
from the master after updating conf\slaves and also start-slave.sh spark://spark:7077
from the slave.
[Master hostname - spark; Worker hostanme - worker]
15/07/01 11:54:16 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@spark:7077] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://sparkMaster@spark:7077]]. 15/07/01 11:54:59 ERROR Worker: All masters are unresponsive! Giving up. 15/07/01 11:54:59 INFO Utils: Shutdown hook called
When I try to telnet from the slave to the master, this is what I get -
root@worker:~# telnet spark 7077 Trying 10.xx.xx.xx... Connected to spark. Escape character is '^]'. Connection closed by foreign host.
Telnet seems to work but the connection is closed as soon as it is established. Could this have something to do with the problem ?
I have added the master and slave IP addresses in /etc/hosts on both machines. I followed all the solutions given at SPARK + Standalone Cluster: Cannot start worker from another machine but they have not worked for me.
I have the following config set in spark-env.sh in both machines -
export SPARK_MASTER_IP=spark
export SPARK_WORKER_PORT=44444
Any help is greatly appreciated.
sbin/start-master.sh - Starts a master instance on the machine the script is executed on. sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file. sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
So, yes, failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
The Apache Spark framework uses a master-slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster.
I encounter the exact same problem as you and just figure out how to get it to work.
The problem is that your spark master is listening on hostname, in your example spark, which causes the worker on the same host being able to register successfully but failed from another machine with command start-slave.sh spark://spark:7077
.
The solution is to make sure the value SPARK_MASTER_IP is specified with ip in file conf/spark-env.sh
SPARK_MASTER_IP=<your host ip>
on your master node, and start your spark master as normal. You can open your web GUI to make sure your spark master appears as spark://YOUR_HOST_IP:7077 after the start. Then, on another machine with command start-slave.sh spark://<your host ip>:7077
should start and register worker to master successfully.
Hope it would help you
Its depends on your spark version, it will need different conf. if your spark version 1.6 add this line to conf/spark-env.sh
so another machine can connect to master
SPARK_MASTER_IP=your_host_ip
and if your spark version is 2.x add these lines to your conf/spark-env.sh
SPARK_MASTER_HOST=your_host_ip
SPARK_LOCAL_IP=your_host_ip
after adding these lines run spark :
./sbin/spark-all.sh
and if you do right , you can see in <your_host_ip>:8080
that spark master url is:spark://<your_host_ip>:7077
BeCarefule your_host_ip ,shouldnt be localhost
and It must be exactly Your host ip
that you set in conf/spark-env.sh
after all you can connect another machine to the master with command below:
./sbin/start-slave.sh spark://your_host_ip:7077
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With